NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a developer, I want the COWRES to employ Redisson 3.19.0 or later in the tasker #100

Open epag opened 2 months ago

epag commented 2 months ago

Author Name: Hank (Hank) Original Redmine Issue: 111397, https://vlab.noaa.gov/redmine/issues/111397 Original Date: 2023-01-09


In our attempt to deploy 6.10, we attempted to use 3.19.0, but the Tasker failed to come up and reported this error:

Exception in thread "main" java.lang.ExceptionInInitializerError
        at wres.tasker.Tasker.main(Tasker.java:128)
Caused by: java.lang.ClassCastException: class java.lang.Float cannot be cast to class java.lang.String (java.lang.Float and java.lang.String are in module java.base of loader 'bootstrap')
        at wres.tasker.WresJob.<clinit>(WresJob.java:207)
        ... 1 more

The same problem occurred using 3.19.1. See #111117 for the discussion.

Hank


Redmine related issue(s): 118525, 121563


epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-02-23T11:45:20Z


Slipping this.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-04-18T18:40:00Z


Upping the priority of this one.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-04-18T19:05:55Z


Yeah, I guess the clue is here, they changed the default codec, so the new codec is probably not deserializing in the same way as the old codec.

https://github.com/redisson/redisson/releases/tag/redisson-3.19.0

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-04-18T19:13:01Z


As to why they didn't bump the major version, idk. edit: I mean, given that it's listed as a breaking change.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-04-18T19:19:26Z


The error occurs here:

206             RBucket<String> bucket = REDISSON_CLIENT.getBucket( "databaseName" );
207             if ( bucket.get() != null && !bucket.get().isBlank() )
208             {
209                 WresJob.activeDatabaseName = bucket.get();
210             }
</code>

I would have though the problem would be with 206, but the exception indicates 207. Hmmm...

Anyway, a mitigation is going to have to be identified, and I hope it doesn't require removing the .aof, but I guess we'll see.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-04-18T19:23:43Z


The casting happens when an item is received from the bucket, I suppose, so it's the second part of L207.

Either way, the problem will be the codec, I would stake a small bet on it :-)

The quick fix will be to use whatever was the default codec before, else to find a better one that is not deprecated.

There's an opportunity to set it around L171, looks like.

For example, as an experiment, this would probably fix it:

redissonConfig.setCodec( new MarshallingCodec() )
</code>

But that is deprecated now (edit: so, to be clear, even if that is a quick fix, we shouldn't use it).

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T10:34:33Z


Arvin,

As you're working in this area, this might be a good one to look into when you're done with #108899 - just a thought, up to you, of course. Currently, it's blocking us from upgrading redisson, which isn't a big deal right now, but it would become a bigger deal if cves emerge and require an upgrade (that becomes available).

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-15T19:41:32Z


Looking into this and found the current version in the build.gradle we are using:

@implementation( 'org.redisson:redisson:3.18.1' )@

First thing I'm going to attempt to do is reproduce this error.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-16T12:38:24Z


To clarify, if I upgrade the version to @3.19.0@ and I deploy to staging, in the docker logs, the tasker would throw an error immediately? Nothing needs to be triggered for the error to occur correct?

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-16T12:45:12Z


Correct, the tasker won't start, so you will see the error without any special test harness.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-16T14:24:47Z


How do we know that this line is going to return a string value?

RBucket<String> bucket = REDISSON_CLIENT.getBucket( "databaseName" );
</code>

What I'm thinking is that this line is returning a Float causing the casting error? But that goes back to what Hank said, why is the error not happening in line 206?? It's saying line 207 which I am then assuming that its happening at this point:

&& !bucket.get().isBlank()
</code>

This is where some sort of call to the String class is made, but I still don't understand why the error happens at 207 and not 206 and how we can assume the @bucket@ value is a string.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-16T14:43:26Z


Arvin wrote:

How do we know that this line is going to return a string value?

We don't, in the sense it's guaranteed, and it doesn't. We do know that the database name supplied to the cache is a string. But it's probably because they changed the codec. At least, that's what the documentation suggests.

I predict that, if you do the following, the error will no longer occur:

redissonConfig.setCodec( new MarshallingCodec() )
</code>

Arvin wrote:

why is the error not happening in line 206?? It's saying line 207 which I am then assuming that its happening at this point:

I think that's because the formal cast doesn't happen until an item is read from the @RBucket@ using @RBucket::get@, but you would need to find the implementation of that API to confirm.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-16T15:36:37Z


Do we know the message message that is written in the logs when the tasker successfully starts?

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-16T16:10:05Z


I don't think there's any particular message like, say, the worker shim which says "waiting for work" on @INFO@, but there won't be any errors (e.g., @TaskerFailedToStartException@), and you will also get a bunch of detail on a standard stream from the jetty web server instance that is wrapped by the tasker, so it should be pretty clear.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-16T17:37:50Z


Plus, when the tasker starts, the @.../job@ and @.../api@ COWRES endpoints will become available (@.../job@ will return "Up").

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-16T17:45:54Z


Thank you James and Hank,

I am still doing my initial research into the matter. I think I may have a solution that involves using the @Object@ data type considering its the super class of every other object. Still looking into it.

            RBucket<Object> bucket = REDISSON_CLIENT.getBucket( "databaseName" );
            Object bucketValue = bucket.get();
            if ( bucketValue != null && !bucketValue.toString().isBlank() )
            {
                WresJob.activeDatabaseName = bucketValue.toString();
            }
</code>

Something like this.. we can invoke the toString method and theoretically no matter the type it should be casted to a string and this wont change the logic too much.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-20T17:06:33Z


Hello Hank,

So I've implemented a change to be able to address the error we've been getting in the version upgrade. Can I push my changes up to master and can you please create a Docker Image for me and I will cycle the containers, please. Or should I try this: https://stackoverflow.com/questions/23935141/how-to-copy-docker-images-from-one-host-to-another-without-using-a-repository that you posted in the other ticket?

Arvin

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-20T18:24:41Z


Arvin,

I welcome you to try the save/load approach, as a test. Let me know how it goes.

You can use @dockerize.sh@ to build the images after you push the code and Jenkins builds it. Since Redisson is used only by the tasker, you need only worry about the wres-tasker image being @saved@/@loaded@ for test purposes. In the stackoverflow, I think the "image name" refers to the "IMAGE ID" from the @docker image ls@ output:

[Hank@nwcal-dock-dev01 ~]$ docker image ls
REPOSITORY                                               TAG                    IMAGE ID       CREATED       SIZE
wres/wres-worker                                         20230609-6735acc-dev   365d0ed52b6c   4 days ago    611MB
nwcal-registry.[host]/wres/wres-worker         20230609-6735acc-dev   365d0ed52b6c   4 days ago    611MB
wres/wres-worker                                         20230616-a275e47       66cb84eb6148   4 days ago    611MB
nwcal-registry.[host]/wres/wres-worker         20230616-a275e47       66cb84eb6148   4 days ago    611MB
nwcal-registry.[host]/wres/wres-graphics       20230616-265322e       119a5bdfc23e   4 days ago    538MB
wres/wres-graphics                                       20230616-265322e       119a5bdfc23e   4 days ago    538MB
...

So something like, @docker save [IMAGE ID for wres-tasker]@ on the @nwcal-dock-dev01@, note the file generated, and @docker load [file name]@ on the @nwcal-wres-ti02@ should work. Recall that the @/home@ directories are visible across machines, so if you save the output in your home directory, that file should be available to staging. After loading the image, run @docker image ls@ to see what image was added, specifically noting the "REPOSITORY" and "TAG".

With the image loaded, modify this file in @nwcal-wres-ti02@:

@/mnt/wres_share/deployment/compose-entry-61488.yml@

Modify this line in the .yml to match the "REPOSITORY" and "TAG" for your loaded image:

image: "${DOCKER_REGISTRY}/wres/wres-tasker:20230614-eeb7b35"

The "REPOSITORY" name goes before the ':'; the "TAG" after.

Cycle the containers and you should be good. Note that Evan's changes only impact the wres-worker image, so you should be good so long as you stick to only the tasker. Let me know if you have any questions,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-20T18:44:31Z


I just did a test of saving and loading a @wres-worker@ image. First, the @-o@ property is required for the @save@; I ran this:

@docker save -o wres-worker.tar 365d0ed52b6c@

On the -dev03, I loaded the image:

@docker load -i /home/Hank/wres-worker.tar@

After loading, I noted that the image did not have a repository or tag:

[Hank@nwcal-wres-dev03 deployment]$ docker image ls
REPOSITORY                                               TAG                IMAGE ID       CREATED       SIZE
<none>                                                   <none>             365d0ed52b6c   4 days ago    611MB
...

In the .yml files, we refer to images by repository and tag, so we should add them. To do so, use the @docker tag@ command. For example,

@docker tag 365d0ed52b6c wres/wres-worker:testtag@

will yield this:

[Hank@nwcal-wres-dev03 deployment]$ docker tag 365d0ed52b6c wres/wres-worker:testtag
[Hank@nwcal-wres-dev03 deployment]$ docker image ls
REPOSITORY                                               TAG                IMAGE ID       CREATED       SIZE
wres/wres-worker                                         testtag            365d0ed52b6c   4 days ago    611MB
...

I can then refer to that image as, "wres/worker:testtag", in the .yml.

In your case, I recommend using the tag, @wres/wres-tasker:[meaningful tag name here]@, where the meaningful tag name is either the revision tag, such as "20230614-eeb7b35", or perhaps something tied to your ticket, like, "111397_test". Ultimately, the choice is yours.

Thanks,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-20T18:56:28Z


Will push my changes now and begin trying this

Thank you!

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-20T19:54:57Z


I am now at the point where I have created the docker images and ran the command @docker image ls@

REPOSITORY                                               TAG                    IMAGE ID       CREATED          SIZE
wres/wres-graphics                                       20230620-2cffdf5       92bb73d332d1   44 seconds ago   538MB
wres/wres-tasker                                         20230620-2cffdf5       d1a0df401ee8   52 seconds ago   540MB
wres/wres-worker                                         20230620-2cffdf5       96f14c486602   58 seconds ago   611MB
wres/wres-worker                                         20230609-6735acc-dev   365d0ed52b6c   4 days ago       611MB
nwcal-registry.[host]/wres/wres-worker         20230609-6735acc-dev   365d0ed52b6c   4 days ago       611MB
wres/wres-worker                                         20230616-a275e47       66cb84eb6148   4 days ago       611MB
nwcal-registry.[host]/wres/wres-worker         20230616-a275e47       66cb84eb6148   4 days ago       611MB
wres/wres-graphics                                       20230616-265322e       119a5bdfc23e   4 days ago       538MB
nwcal-registry.[host]/wres/wres-graphics       20230616-265322e       119a5bdfc23e   4 days ago       538MB
wres/wres-tasker                                         20230614-eeb7b35       c7c6aa480e94   6 days ago       539MB
nwcal-registry.[host]/wres/wres-tasker         20230614-eeb7b35       c7c6aa480e94   6 days ago       539MB
wres/wres-eventsbroker                                   20230609-6735acc       e6d397b65812   2 weeks ago      496MB
wres/wres-eventsbroker                                   20230620-2cffdf5       e6d397b65812   2 weeks ago      496MB
nwcal-registry.[host]/wres/wres-eventsbroker   20230609-6735acc       e6d397b65812   2 weeks ago      496MB
wres/wres-redis                                          20230609-6735acc       6f11083bf2a1   2 weeks ago      27.2MB
wres/wres-redis                                          20230620-2cffdf5       6f11083bf2a1   2 weeks ago      27.2MB
nwcal-registry.[host]/wres/wres-redis          20230609-6735acc       6f11083bf2a1   2 weeks ago      27.2MB
wres/wres-broker                                         20230609-6735acc       b701fa9ee8a1   2 weeks ago      173MB
wres/wres-broker                                         20230620-2cffdf5       b701fa9ee8a1   2 weeks ago      173MB
nwcal-registry.[host]/wres/wres-broker         20230609-6735acc       b701fa9ee8a1   2 weeks ago      173MB
</code>

Here is what it looks like. I have 3 images that have been updated, the @graphics, tasker, and worker@ what should be the command I run here?

Should it be this: @docker save -o wres-tasker.tar d1a0df401ee8@ I used the image ID for the tasker in this command, is this command correct? If so what about @graphics@ and @worker@ ?

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-20T20:05:38Z


Yes, that command looks right.

To the best of my knowledge, Redisson should only impact the component that talks with the Redis "persister", and that is the tasker. The other two, the worker and graphics clients, should not be impacted. James: Am I wrong about that?

The worker is what Evan is working on, currently, so you should avoid deploying that to staging so you don't overwrite his changes (which are being tested from a branch, not from a Jenkins build).

I would @save@ and @load@ the tasker image, only, to staging, and update the .yml I mentioned above, @compose-entry-61488.yml@, which can then be used to cycle the containers. You can then take a loog to see if something goes awry.

Evan will hopefully be merging his branch into the trunk soon (I think he wanted to test a few things first), which will ease the coordination.

Thanks,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-20T20:08:01Z


Graphics client is independent of any service layer components, yes. Deploying an RC of the tasker should be enough to test this one.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-20T20:13:35Z


Another option could be to deploy your changes to the -dev COWRES:

entry machine: nwcal-wres-dev03 workers-only machine: nwcal-wres-dev02

In that case, you can @save@/@load@ all three of the impacted images, or, again, just the tasker since I think its the only one really impacted. The latest .yml files in -dev are the two @/mnt/wres_share/deployment/modified.*@ files, which were modified to shrink their memory usage to fit in the -dev COWRES allocations. Edit those to point to the new images.

I'll be back tomorrow if you have any questions. Have a great evening!

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-20T20:15:14Z


The -dev COWRES API end point is @https://nwcal-wres-dev.[domain]/index.html@. You can post jobs from there and observe the behavior of the tasker/persister by looking at the status, output, and stdout for the jobs you post.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-20T20:29:03Z


Thank you Hank! Have a great rest of your day! I think I will deploy to @ti02@ for now. I'm going to run and understand a bit more this process and let you know if I have any questions.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T12:32:25Z


I believe I got it to the point of being able to cycle containers so I'm going to try that now.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T12:37:43Z


On @ti03@ when trying to bring down the workers I received:

@error while removing network: network deployment_wres_net id 60238abb85fdaed4c242e192c3e5283be7a9617a4ec4a3495424ea15a08fab5f has active endpoints@

Should I try @docker volume prune@ or is there something else I can do?

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-21T12:45:47Z


That is a problem that has been coming up lately in the -ti03 and I'm not sure what the root cause is. The only solution is pruning everything and restarting Docker. However, it doesn't appear to interfere with deployment, so I suggest you just ignore the error for now, at least until we detect issues related to it.

Thanks,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T12:50:13Z


If I ignore it then I can't cycle the worker containers because I can't bring them down. Maybe I misinterpreted what you said?

Thank you,

Arvin

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-21T13:03:59Z


The containers should still come down: If you run @workers_down.sh [blah]@ and then @docker container ls@, you will see no containers running. At least, that's what happened for me last week when this started happening. If you think the containers are not coming down, let me know and I'll take a look.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T13:11:34Z


You are right I just checked and it does bring down the containers :)

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T13:12:23Z


I tried to bring up the @ti02@ and got this error:

[Arvin@nwcal-wres-ti02 deployment]$ ./entry_up.sh compose-entry-61488.yml
Creating the image pull container...
Running ... docker create --rm -e HOST_NAME=nwcal-wres-ti02.[host] -v /var/run/docker.sock:/var/run/docker.sock -v /mnt/wres_share/deployment:/mnt/wres_share/deployment -w /mnt/wres_share/deployment --cap-drop ALL --cpus 2 --memory 512M docker/compose:1.29.2 --file compose-entry-61488.yml pull.
The container id returned for the pull container is 7a03a2c0f185e5f481c3eca6b2eaf3087f298b06637f3b7d4536c8243d6bff94
Running ... docker cp ~/.docker 7a03a2c0f185e5f481c3eca6b2eaf3087f298b06637f3b7d4536c8243d6bff94:/root/.
Successfully copied 19.5kB to 7a03a2c0f185e5f481c3eca6b2eaf3087f298b06637f3b7d4536c8243d6bff94:/root/.
Start the container 7a03a2c0f185e5f481c3eca6b2eaf3087f298b06637f3b7d4536c8243d6bff94 to pull images. Observe the logging to see if problems occur.
Running ... docker container start -a 7a03a2c0f185e5f481c3eca6b2eaf3087f298b06637f3b7d4536c8243d6bff94
Pulling persister    ...
Pulling broker       ...
Pulling tasker       ...
Pulling eventsbroker ...
Pulling worker       ...
Pulling graphics     ...
Pulling broker       ... error
Pulling worker       ... error
Pulling eventsbroker ... error
Pulling tasker       ... error
Pulling graphics     ... error
Pulling persister    ... error

ERROR: for broker  Head "https://nwcal-registry.[host]/v2/wres/wres-broker/manifests/20230609-6735acc": no basic auth credentials

ERROR: for worker  Head "https://nwcal-registry.[host]/v2/wres/wres-worker/manifests/20230609-6735acc-dev": no basic auth credentials

ERROR: for eventsbroker  Head "https://nwcal-registry.[host]/v2/wres/wres-eventsbroker/manifests/20230609-6735acc": no basic auth credentials

ERROR: for tasker  Head "https://nwcal-registry.[host]/v2/wres/wres-tasker/manifests/20230614-eeb7b35": no basic auth credentials

ERROR: for graphics  Head "https://nwcal-registry.[host]/v2/wres/wres-graphics/manifests/20230616-265322e": no basic auth credentials

ERROR: for persister  Head "https://nwcal-registry.[host]/v2/wres/wres-redis/manifests/20230609-6735acc": no basic auth credentials
Head "https://nwcal-registry.[host]/v2/wres/wres-broker/manifests/20230609-6735acc": no basic auth credentials
Head "https://nwcal-registry.[host]/v2/wres/wres-worker/manifests/20230609-6735acc-dev": no basic auth credentials
Head "https://nwcal-registry.[host]/v2/wres/wres-eventsbroker/manifests/20230609-6735acc": no basic auth credentials
Head "https://nwcal-registry.[host]/v2/wres/wres-tasker/manifests/20230614-eeb7b35": no basic auth credentials
Head "https://nwcal-registry.[host]/v2/wres/wres-graphics/manifests/20230616-265322e": no basic auth credentials
Head "https://nwcal-registry.[host]/v2/wres/wres-redis/manifests/20230609-6735acc": no basic auth credentials
Docker command to start the pull image container failed.  See logging above.
</code>
epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-21T13:30:10Z


It appears as though the images you need to bring up the containers were not found locally. Checking,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-21T13:32:23Z


Oh right... Add "-n" to the *_up.sh calls so that you don't attempt to pull images from the registry. Example:

./entry_up.sh -n compose-entry-61488.yml

Does that work for you?

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T13:38:09Z


Yes, that worked! :)

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T13:47:21Z


Hank,

Quick question, when doing @docker image ls@ I see 2 images relating to the tasker:

wres/wres-tasker                                         20230614-eeb7b35       d1a0df401ee8   18 hours ago   540MB
nwcal-registry.[host]/wres/wres-tasker         20230614-eeb7b35       c7c6aa480e94   6 days ago     539MB
</code>

In the @compose-entry-61488.yml@ I kept this: @image: "${DOCKER_REGISTRY}/wres/wres-tasker:20230614-eeb7b35"@ because I changed the tag of the image I loaded to the revision tag as we discussed. When bringing up the docker images does it use the latest @wres-tasker@ the one created 18 hours ago?

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-21T13:57:45Z


By including "${DOCKER_REGISTRY}", the image used will be the one from 6 days ago. To use the image from 18 hours ago (saved and loaded instead of using the registry), you need to remove "${DOCKER_REGISTRY}" from the image name in the .yml.

I think. I've never done this before, but it makes sense to me. Let me know if it works.

You may also want to remove the "nwcal-registry.[host]/wres/wres-tasker" image to ensure that the image is not used:

@docker image rm c7c6aa480e94@

If something goes awry, I can always hop onto the machine and pull the needed images.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T14:27:31Z


After I pushed my changes the initial error was resolved but now We are getting and error here:

tasker_1        | Exception in thread "main" java.lang.ExceptionInInitializerError
tasker_1        |       at wres.tasker.Tasker.main(Tasker.java:128)
tasker_1        | Caused by: java.lang.ClassCastException: class java.lang.Float cannot be cast to class java.lang.String (java.lang.Float and java.lang.String are in module java.base of loader 'bootstrap')
tasker_1        |       at wres.tasker.JobResults.<init>(JobResults.java:218)
tasker_1        |       at wres.tasker.WresJob.<clinit>(WresJob.java:273)
tasker_1        |       ... 1 more
</code>
epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T14:29:11Z


Line 218:

        for ( Map.Entry<String, JobMetadata> nextMetadata : this.jobMetadataById.entrySet() )
        {
            *String jobId = nextMetadata.getKey();*
            JobMetadata metadata = nextMetadata.getValue();

</code>

String jobId = nextMetadata.getKey();

Line 273:

    private static final JobResults JOB_RESULTS = new JobResults( CONNECTION_FACTORY,
                                                                  REDISSON_CLIENT );
</code>
epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-21T14:30:53Z


Arvin,

I recommend you revert your change to the yaml and cycle the containers again (at least on nwcal-wres-ti02) so that the COWRES is available in staging.

FYI... This wiki discusses how to build individual tasker images during development, which can then be deployed to staging without having to push your revisions:

https://vlab.noaa.gov/redmine/projects/wres/wiki/Building_and_Deploying_Individual_Docker_Images

I can show-and-tell the process this afternoon if you like.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-21T14:33:18Z


Arvin,

Did you try to the change the codec as suggested in #111397-8?

I think this issue will need to be addressed at the root cause, which is probably the codec...

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T14:39:11Z


James,

I have not because of what you said here. I tried to attack the root of the cause but it seems because it's a codec issue, its transitive, meaning it will likely just continue to waterfall down. Do you suggest I change the codec?

James wrote:

The casting happens when an item is received from the bucket, I suppose, so it's the second part of L207.

Either way, the problem will be the codec, I would stake a small bet on it :-)

The quick fix will be to use whatever was the default codec before, else to find a better one that is not deprecated.

There's an opportunity to set it around L171, looks like.

For example, as an experiment, this would probably fix it:

[...]

But that is deprecated now (edit: so, to be clear, even if that is a quick fix, we shouldn't use it).

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T15:19:25Z


I went ahead and changed the codec to @MarshallingCodec@ and saved and loaded the image and cycled containers and the problem is fixed. The tasker starts with the upgraded version of Redisson to @3.19.0@

However, can we keep this solution considering @MarshallingCodec@ is deprecated?

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-21T15:25:40Z


No, we cannot, but it confirms that the root cause of the problem is the codec and that hopefully gives you a hook for finding a solution. Perhaps there are other non-deprecated (non-default) codecs that will work or perhaps you can find other information about the breaking change or post a ticket if not. The main reason for this suggestion was to establish the root cause, which we now know is the codec.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T17:41:55Z


So I have tried multiple codecs to no success:

@KryoCodec@ @Kyro5Codec@ (Default codec from version 3.19+) @JsonJacksonCodec@

Does anyone have any other suggestions?

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T18:25:32Z


https://github.com/redisson/redisson/wiki/4.-data-serialization - List of Codecs available in Redisson.

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T20:13:52Z


James,

Do you have any suggestions? Should I try to use the default codec with 3.19.0 which will require code changes or try other codecs? I went back to the error I got earlier today regarding the original code changes I made and I was looking into that but what do you suggest should be the approach here?

Arvin

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-21T21:08:40Z


Perhaps you can create a ticket here, asking for advice:

https://github.com/redisson/redisson/issues

It seems odd to me that the codec would behave this way unless we were misusing it. Either way, they should be able to offer advice. If there isn't a non-deprecated codec that "just works", I think we're stuck. It isn't clear to me what we could/would change in our code, since a codec is merely a low-level implementation for (co)mpressing and (dec)ompressing data.

I would probably start by attempting to create a unit test harness for this. It is suboptimal to have to deploy the service to test/reproduce. I would probably then cycle through the various codecs and see if one works, while also posting a support ticket, but it helps to have a reproducible example for a ticket (see the unit test harness).

epag commented 2 months ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-21T22:06:55Z


I have created a ticket with them, It doesn't seem like they are very responsive but hopefully we get some sort of a response. https://github.com/redisson/redisson/issues/5119

I completely agree with you that it is suboptimal to need to redeploy to see if the changes made fixes the issues however, where the error occurs in @WresJob.java@ it seems very difficult to test (if even possible for what we want)

Maybe I can set up a quick call with you tomorrow morning James as I have some more questions, and we can discuss further if that's ok with you?