NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a developer, I don't want posted data to be removed after evaluation completes if that evaluation failed #92

Open epag opened 4 weeks ago

epag commented 4 weeks ago

Author Name: Hank (Hank) Original Redmine Issue: 108899, https://vlab.noaa.gov/redmine/issues/108899 Original Date: 2022-10-05 Original Assignee: Arvin


I thought I already had a ticket for this, but I can't find it. This may be dangerous if COWRES becomes popular and lots of users use direct posting. Saving the data allows for reproducing errors, but most errors are likely declaration errors, not data errors, so we may end up saving quite a few files for no reason.

I guess I'm on the fence about this and willing to reject if we decide its too risky.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-09T19:35:20Z


Just got off the call with Hank and we went through the deployment process to staging. James, I now understand what you meant by changing the YML files.

We did find an interesting question, so on some initial testing Hank made an evaluation fail on purpose which triggered a @FAILED_BEFORE_IN_QUEUE@ when checking the status. The question is, because it failed before being assigned a worker the input data was not stored. However, Hank ran another execution and it failed after it began the evaluation and the input files were kept :D so initial testing on that seems to be working! Further testing will come next week in the sense of threshold and deleting till there is space in the mounted disk.

The question is what do we do with the input files that were never assigned a worker, do we keep those files or can we go ahead and delete those?

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-09T19:44:56Z


My two cents...

The FAILED_BEFORE_IN_QUEUE occurred due to a YAML validation error. Such an error has nothing to do with the posted data, which isn't even examined until an evaluation ingests it. Hence, I think we should remove the data.

Just remember that I brought up the idea of keep the data in all circumstances and clearing out old data in the same way we clear out old heap dumps: during deployment or when otherwise needed. That would be another ticket, and likely for a later time. For now, I don't see reason to keep the data for a failure that occurs before a worker is handed the evaluation.

Thanks and have a great weekend!

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-12T13:32:49Z


Arvin,

I set the status to Feedback, since you are looking for opinions, and I assigned this ticket to you to reflect reality.

Note that I'm deploying a new revision now for testing a change made by James to address. This is where it may be possible for me to step on your toes. If you need to make a change to the .yml file in staging to support your own testing, just note it here, and, if I have to deploy a new revision, I'll update the .yml to account for your change. If I deploy while an evaluation you run is on-going, the evaluation will be halted and restarted after the deployment. It may cause a delay in your testing.

Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-12T13:44:07Z


Hank wrote:

Arvin,

I set the status to Feedback, since you are looking for opinions, and I assigned this ticket to you to reflect reality.

Note that I'm deploying a new revision now for testing a change made by James to address. This is where it may be possible for me to step on your toes. If you need to make a change to the .yml file in staging to support your own testing, just note it here, and, if I have to deploy a new revision, I'll update the .yml to account for your change. If I deploy while an evaluation you run is on-going, the evaluation will be halted and restarted after the deployment. It may cause a delay in your testing.

Thanks,

Sounds good, Hank! I've been spending time going over the deployment documentation again to be able to understand the process better. While doing this, I did have some questions pop up which I am noting to be able to ask you. I do have to make changes to the YML file but I will do that after doing some large scale evaluations just to get the disc space filled up to a point of testing. Do you have availability today to be able to assist me with this and to go over some of the deployment questions I have?

Thank you,

Arvin

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-12T13:45:44Z


Sure. Today is pretty open, so just block some time on my Calendar.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-12T17:37:20Z


Hank,

I've been running an evaluation for quite some time now and has not failed. I am running the same evaluation a second time (the first one failed as expected) but the second one has been running for a while now, seems to be hung up. Why is this?

Job ID: 5264750713954335779

Arvin

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-12T17:46:42Z


Arvin,

It appears that using the invalid USGS NWIS URL is not the way to go. The WRES started ingesting data for the other sources. As soon as it attempts USGS NWIS, it will fail, but, until then, you are in a holding pattern.

James: When is the USGS NWIS, "left" source processed relative to the other sides? Is it sorta random ordering?

You also note several warnings:

2023-06-12T16:51:10.988+0000 INFO SourceLoader Loading the declared datasets. Depending on many factors (including dataset size, dataset design, data service implementation, service availability, network bandwidth, network latency, storage bandwidth, storage latency, concurrent evaluations on shared resources, concurrent computation on shared resources) this can take a while...
2023-06-12T17:17:20.628+0000 WARN IncompleteIngest Another task started to ingest a source but did not complete it. This source will now be removed from the database. The source to be removed is: Source: { path: file:///mnt/wres_share/input_data/3923312112907549474_12650194114371711620/null/s2021073012_RDRP1SCH_hefs_export.xml, Lead: null, Hash: C14C3DF949B1C08CDE30154788EA3450 }.
2023-06-12T17:19:19.392+0000 WARN IncompleteIngest Another task started to ingest a source but did not complete it. This source will now be removed from the database. The source to be removed is: Source: { path: file:///mnt/wres_share/input_data/3923312112907549474_4473017180005842086/null/2022021712_NAEFS_export.xml, Lead: null, Hash: 5A737A59BA160C0529174A01E94E8FB7 }.
2023-06-12T17:19:19.400+0000 WARN IncompleteIngest Another task started to ingest a source but did not complete it. This source will now be removed from the database. The source to be removed is: Source: { path: file:///mnt/wres_share/input_data/3923312112907549474_4473017180005842086/null/2022021712_NAEFS_export.xml, Lead: null, Hash: 3083FF82B70025895CB404BC32A62A5E }.
2023-06-12T17:30:35.493+0000 WARN IncompleteIngest Another task started to ingest a source but did not complete it. This source will now be removed from the database. The source to be removed is: Source: { path: file:///mnt/wres_share/input_data/3923312112907549474_10042952606280425473/null/s2019081612_MONP1TOW_hefs_export.xml, Lead: null, Hash: BFAD3DF73006C277F2C8A88E088F83F7 }.
2023-06-12T17:30:35.501+0000 WARN IncompleteIngest Another task started to ingest a source but did not complete it. This source will now be removed from the database. The source to be removed is: Source: { path: file:///mnt/wres_share/input_data/3923312112907549474_10042952606280425473/null/s2019081612_MNVN4RTN_hefs_export.xml, Lead: null, Hash: FF319F9B3A168C0E051F23AD7F602502 }.
2023-06-12T17:39:35.619+0000 WARN IncompleteIngest Another task started to ingest a source but did not complete it. This source will now be removed from the database. The source to be removed is: Source: { path: file:///mnt/wres_share/input_data/3923312112907549474_7954888972248282245/null/2018111812_NAEFS_export.xml, Lead: null, Hash: AE46BAF5FC10A8D932065291A06769B0 }.
2023-06-12T17:39:37.179+0000 WARN IncompleteIngest Another task started to ingest a source but did not complete it. This source will now be removed from the database. The source to be removed is: Source: { path: file:///mnt/wres_share/input_data/3923312112907549474_7954888972248282245/null/2018111812_NAEFS_export.xml, Lead: null, Hash: AEE8484EE869E94D2501C484B8D304A1 }.
2023-06-12T17:39:37.187+0000 WARN IncompleteIngest Another task started to ingest a source but did not complete it. This source will now be removed from the database. The source to be removed is: Source: { path: file:///mnt/wres_share/input_data/3923312112907549474_7954888972248282245/null/2018111812_NAEFS_export.xml, Lead: null, Hash: F94EFB8FBA3F21959798098DC223F2C4 }.

I believe the WRES is finding sources that were only partially loaded by your previous failed evaluation. Those sources are being removed since they are partial.

Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-12T17:53:02Z


The reason why it appears frozen is probably (not certain) because its reading in the data, and there is a lot of data to read.

James:

We are using Seann's 12 GB evaluation to grow the input files directory and test Arvin's changes for this ticket. We want to come up with a declaration that will allow for the evaluation and data to be posted, but will then fail quickly upon execution, and it should not fail while in queue. The tricky part is that validation is now occurring upon posting @inputDone@ and before handing it off to a worker. So we need an evaluation that will pass that validation, but is guaranteed to fail out before data ingest.

Can you recommend a declaration issue to insert into the evaluation that guarantees that? Relaying on a bad USGS NWIS URL apparently is not good, because some data may be ingested, which will waste time.

Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-12T18:02:01Z


Those warning messages are normal if you terminated an evaluation mid-ingest. Basically, the wres is clearing up after you at the next available opportunity.

Regarding ingest, there are no promises about order of reading or ingest. Basically, a read/ingest pipeline is created for each identified data source and submitted for execution and the order in which those mini-pipeline tasks are completed is undefined, but the ingest activities are executed against an ingest executor with a fixed number of threads and the reading activities are executed against a reading executor with a fixed number of threads.

All that said, I would've thought the best way to test this would be to put the threshold below the current disk capacity and then trigger a failure, rather than attempting to increase it, in which case you should see nothing deleted. But perhaps you want to test the threshold-crossing event too? Either way, I would start with the simplest case.

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-12T18:05:44Z


James,

The idea to set the threshold below what it's currently at could work, but the @input_directory@ directory before populating with some evaluation data was empty, meaning it would have nothing to delete to make up the space.

Arvin

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-12T18:06:58Z


Hank wrote:

Can you recommend a declaration issue to insert into the evaluation that guarantees that? Relaying on a bad USGS NWIS URL apparently is not good, because some data may be ingested, which will waste time.

I would start by dropping the threshold below the available space, rather than trying to engineer a threshold-crossing event, which seems a lot harder. If you want a data-posted declaration that passes validation and fails shortly thereafter, you should look to reading as the source of failure because that is the next step, so I think you have the right idea. Try posting a corrupted data source. But, again, I would start with the simplest case for reproducing expected behavior, not some massive evaluation that fills up disk space and triggers a threshold-crossing event.

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-12T18:07:26Z


Arvin wrote:

the @input_directory@ directory before populating with some evaluation data was empty, meaning it would have nothing to delete to make up the space.

Can't you add some (edit: I mean, manually)?

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-12T18:18:48Z


Arvin:

The simplest case we have is the one I demonstrated using @/home/ISED/wresTestData/issue103897/commands.txt@ (the second set of commands are designed for a staging run using YAML). The easiest way to cause it to fail is to modify the observations to use an invalid USGS NWIS URL. With so little data to ingest, it should fail quickly.

If you then redeploy the service forcing the threshold below the current disk size, you should see those files removed. I agree with James that that would be a good first test.

However, we still have the question about a threshold crossing event: whether that requires testing and how to do so. Seann's evaluation will make it a bit easier to ensure a threshold is cross, but there may not be a good way to cause it to fail quickly. I note that the evaluation you mentioned before did finally fail after 1h 11m as soon as it tried to obtain data from USGS NWIS.

Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-12T18:25:43Z


I don't believe the threshold crossing invokes a different code pathway, right? A data direct evaluation proceeds, data is copied, it is determined whether the space is too little, disk space recovery is attempted. There is no special code pathway for threshold crossing within an evaluation, merely that the recent data posting triggered the threshold to be crossed. The data is copied in advance of these checks, so it's all the same at that point. Personally, I don't think "threshold crossing" is a special case.

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-12T19:04:54Z


I went ahead and manually copied the largest files and the disk space is now 46% of usage. I just create a copy of the @compose-entry-20230612-d1ce2da@ and named it @compose-entry-20230612-d1ce2da.arvin.yml@ and changed threshold to @45@

I will now execute these commands:

@entry_down.sh compose-entry-20230612-d1ce2da.yml@ @entry_up.sh compose-entry-20230612-d1ce2da.arvin.yml@

Will not be bringing down the workers because there is no need for them. Am I allowed to go forward with this?

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-12T20:09:19Z


Sure. Go ahead.

Sorry I wasn't able to reply earlier: I was told last second I need to take my kid to get shots.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-13T00:56:19Z


Attempted command: @entry_down.sh compose-entry-20230612-d1ce2da.yml@ ended up getting this error:

@error while removing network: network deployment_wres_net id 921b00728a7cf4c136846f6d1bdc88dcc23d64e4476dbdb45dcc886b38499e03 has active endpoints@

Hank, we got this error on Friday, did we ever figure out the reason for this? We tried @prune@ but that did not solve the issue.

Arvin

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T09:54:57Z


There will be a container still attached to the network. You can do a @docker network inspect@ to find out which one and then remove it:

https://docs.docker.com/engine/reference/commandline/network_inspect/

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T11:06:46Z


Thanks for that command reminder, James. I had forgotten about @docker network ls@ when I encountered this problem on Friday.

Arvin:

In your case, the cause of the issue is pretty straightforward: you had brought down the entry machine (-ti02) containers before bringing down the workers-only (-ti03) containers. I can still see the containers running on the -ti03 which are connected to the WRES docker network, @deployment_wres_net@.

I need to deploy a change from James related to another ticket this morning, so I'll cycle the containers. However, I encourage you to bring down the service and bring it back up yourself later today if only to get some more practice. Just remember: workers down on the -ti03, entry down on the -ti02, entry up on the -ti02, workers up on the -ti03.

I bet I could write a script that does the entire process, including putting the .yml files where they need to go, and make it runnable from @nwcal-ised-dev1@. It would just require @ssh@ and lots of passwords or getting @ssh@ keys set up, which I had problems doing a couple months ago.

Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T11:17:09Z


I take that back. Even after bringing down all containers on both machine, it still thinks something is attached to the network. The current network is the same one that was running on Friday, so it has yet to bring it down since I was encountering the issues on Friday.

@docker network inspect@ does not seemingly reveal anything about who is connected to the network:

[Hank@nwcal-wres-ti03 deployment]$ docker network inspect deployment_wres_net
[
    {
        "Name": "deployment_wres_net",
        "Id": "8bd1384084b29e4364ef3c1342f7e0433adc4712fc1c30c0549cc5c262f4dabc",
        "Created": "2023-06-09T12:44:51.789805755Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "[IP omitted in case it matters]/26",
                    "Gateway": "[IP omitted]"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {},
        "Options": {},
        "Labels": {
            "com.docker.compose.network": "wres_net",
            "com.docker.compose.project": "deployment",
            "com.docker.compose.version": "1.29.2"
        }
    }
]

But I can't remove it:

[Hank@nwcal-wres-ti02 deployment]$ docker network rm deployment_wres_net
Error response from daemon: error while removing network: network deployment_wres_net id 921b00728a7cf4c136846f6d1bdc88dcc23d64e4476dbdb45dcc886b38499e03 has active endpoints

Maybe I am misinterpreting the inspect output.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T11:22:58Z


In that case, a container was probably removed but not fully removed. In other words, it's in an inconsistent state.

You can use @docker network disconnect --force network container@ to forcibly disconnect a container (edit: @network@ and @container@ being your names), but you will need to find the container that is connected. As a starting point, try it with the containers you instantiated earlier, assuming you have a record of them.

Failing all that, I think you have sudo privileges now to restart the daemon.

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T11:24:41Z


( And, for the avoidance of doubt, you are not interpreting that @network inspect@ incorrectly, there are no listed containers. )

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T11:26:51Z


What does @docker ps -a@ show?

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T11:30:11Z


Had to drop my kid off at school... Back to this...

There are containers running on both machines, but they are the standard, docker-compose containers. For example:

[Hank@nwcal-wres-ti02 deployment]$ docker container ls -a
CONTAINER ID   IMAGE                   COMMAND                  CREATED          STATUS                      PORTS     NAMES
876b9cda4ceb   docker/compose:1.29.2   "sh /usr/local/bin/d…"   15 minutes ago   Exited (1) 15 minutes ago             hungry_hofstadter
4092805bdbfb   docker/compose:1.29.2   "sh /usr/local/bin/d…"   15 minutes ago   Exited (1) 15 minutes ago             magical_bartik
dbd19f052bfb   docker/compose:1.29.2   "sh /usr/local/bin/d…"   11 hours ago     Exited (1) 11 hours ago               strange_sanderson
486e91401255   docker/compose:1.29.2   "sh /usr/local/bin/d…"   22 hours ago     Exited (0) 11 hours ago               exciting_faraday
17167312581c   docker/compose:1.29.2   "sh /usr/local/bin/d…"   22 hours ago     Exited (1) 22 hours ago               naughty_noyce
fda01582bb73   docker/compose:1.29.2   "sh /usr/local/bin/d…"   3 days ago       Exited (0) 22 hours ago               awesome_rosalind

One of those containers are created every time I cycle the containers to deploy a new image. I'm going to prune them all, since they are no longer needed. I'll then run the @docker ps -a@.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T11:34:56Z


After running @docker system prune -a@ to clear everything out, there are no containers running on either machine. Here is the output from @docker ps -a@; its the same on both machines:

[Hank@nwcal-wres-ti03 deployment]$ docker ps -a
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
[Hank@nwcal-wres-ti03 deployment]$ 

Still can't remove the network:

[Hank@nwcal-wres-ti02 deployment]$ docker network rm deployment_wres_net
Error response from daemon: error while removing network: network deployment_wres_net id 921b00728a7cf4c136846f6d1bdc88dcc23d64e4476dbdb45dcc886b38499e03 has active endpoints

I'd rather find a solution that does not require restarting the Docker service; that will be a last resort. Let me see what I can find online,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T11:38:30Z


That is a very odd scenario. I predict that you will need to restart the daemon. In any case, I've got nothin' if there are no hooks to any containers to forcibly remove from the network :-)

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T11:40:27Z


Manual execution with @--remove-orphans@ did not do the trick. On-line, use of that flag and restarting docker appear to be the only possible solutions mentioned. I guess I'll restart Docker and cross my fingers.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T11:41:15Z


Isn't this a long line of issues with docker networks on those machines? I know we've repeatedly been unable to remove networks themselves in the past and resorted to instantiating a growing list of new network names. Could be the same underlying problem. But we didn't have sudo until recently.

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T11:43:48Z


Hank wrote:

Manual execution with @--remove-orphans@ did not do the trick.

I guess that would only work for running containers and we've already shown that there are neither running nor stopped containers according to @docker ps -a@.

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T11:47:08Z


Yes, this is something that happens from time to time. In fact, we have network issues (not related to this one) in production that require a special version of the YAMLs created using a different network IP for the @deployment_wres_net@. Which reminds me that we should probably address that for 6.14, because needing to modify the YAML from staging to production is not good; we want the YAMLs to be identical.

Anyway, instructions for restarting Docker are found here:

https://vlab.noaa.gov/redmine/projects/wres/wiki/Troubleshooting_WRES_under_Docker#Restarting-Docker-and-Viewing-Logs

I executed the three commands as was then able to remove the network. Here is what I did on the -ti02:

[Hank@nwcal-wres-ti02 deployment]$ sudo systemctl restart containerd.service
[sudo] password for Hank: 
[Hank@nwcal-wres-ti02 deployment]$ sudo systemctl restart docker.socket
[Hank@nwcal-wres-ti02 deployment]$ sudo systemctl restart docker.service
[Hank@nwcal-wres-ti02 deployment]$ docker network ls
NETWORK ID     NAME                  DRIVER    SCOPE
fb38ca4b8120   bridge                bridge    local
921b00728a7c   deployment_wres_net   bridge    local
d7d08ba4436c   host                  host      local
d85e530f4b0b   none                  null      local
[Hank@nwcal-wres-ti02 deployment]$ docker network rm deployment_wres_net
deployment_wres_net
[Hank@nwcal-wres-ti02 deployment]$ docker network ls
NETWORK ID     NAME      DRIVER    SCOPE
fb38ca4b8120   bridge    bridge    local
d7d08ba4436c   host      host      local
d85e530f4b0b   none      null      local

Now let me see if I can bring up the container properly,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T11:53:54Z


I know docker is supposed to be super reliable 'n all, but I see a lot of these weird issues, fwiw. I had largely put that down to using a windows machine and your issues down to your crappy IT, but I'm not so sure. Or perhaps it's @compose@ that causes these weird issues, rather than the docker engine. Anyway, maintaining dockerized apps is far from seamless.

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T11:59:02Z


For the benefit of Evan and Arvin, quite a few HEFS folks that are employed by Lynker took some docker training courses recently and that may be an option for y'all if you ask Jason W. I'm not sure whether you have much/any experience of docker, but I think you said not, Arvin. It's probably worth looking into. It's fine when it "just works", but troubleshooting can be a pita.

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-13T12:22:51Z


Hello James and Hank,

So going through the recent messages in the ticket I see that the only way we were able to remove the network was by restarting Docker itself. James, I would love to take the Docker course as It will help in situation like this. I will see if I can reach out to Jason.

Hank, I see on the @/mnt/wres_share/deployment@ directory, there are 2 new YML files, I will get the latest entry YML and edit the threshold and cycle the containers. I will bring down workers on -ti03 and entry on -ti02 and then finally bring them back up.

Thank you for the support and help,

Arvin

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T12:36:37Z


No problem.

In general, you should not expect the network issue. This is the first time I've seen it in a long time. I think it happened more often for Jesse as he was working out the kinks in the initial deployment process.

Anyway, just make your change to the new .yml file and start testing. I have a couple of evaluations to push, but they are quick, so I shouldn't step on your toes (or vice versa).

Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-13T12:46:29Z


Hank,

I was able to take down the workers and the entry containers but was not allowed to start them back up as I still do not have the permission to do so. Do you think you can bring the containers back up please?

-ti03: @./workers_up.sh compose-workers-20230613-1629ad1.yml@ -ti02: @./entry_up.sh compose-entry-20230613-1629ad1.arvin.yml@

Thank you!

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T12:50:39Z


Arvin,

Can you please copy-and-paste the exact error you are receiving so that I can included it in the ServiceNow ticket? I need to put some pressure on them resolving the permissions problem.

Cycling the containers now,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T12:52:11Z


Done. The containers are up and the brokers sees the 5 worker connections.

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-13T12:52:57Z


docker: Error response from daemon: error while creating mount source path '/home/Arvin/.docker/config.json': mkdir /home/Arvin/.docker: permission denied.

Error response from daemon: Get "https://nwcal-registry.nwc.nws.gov/v2/": dial tcp: lookup nwcal-registry.nwc.nws.gov on 10.3.2.3:53: no such host [Arvin@nwcal-wres-ti02 deployment]$ docker login nwcal-registry.[host] Username: Arvin Password: Error response from daemon: login attempt to https://nwcal-registry.[host]/v2/ failed with status: 401 Unauthorized

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-13T12:53:23Z


Hank wrote:

Done. The containers are up and the brokers sees the 5 worker connections.

Hank

Thank you, Hank!

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T12:57:48Z


Error response from daemon: Get "https://nwcal-registry.nwc.nws.gov/v2/": dial tcp: lookup nwcal-registry.nwc.nws.gov on 10.3.2.3:53: no such host

That the wrong host. I need to look up why that's happening.

The @~/.docker@ directory is created once you are able to successfully @docker login nwcal-registry.[host]@. You are then supposed to modify the permissions:

chmod 705 ~/.docker                  # This is only required the first time this is done.
chmod 604 ~/.docker/config.json      # This is only required the first time this is done.

Since you can't login, the directories don't exist, and @docker@ then complains when it tries to create directories in your home directory, to which it obviously will not have write access.

But, again the error response at the top indicates that the wrong URL is being used. Let's see if we can resolve that and then get a more accurate error message. Taking a look,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T13:00:40Z


I see no reference to "nwc.nws.gov" in anything we have in @/mnt/wres_share/deployment@.

Arvin: When you have a chance, please check your environment:

@env | grep "nwc.nws.gov"@

Does it show anything?

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-13T13:02:26Z


Sorry Hank, I pasted the entire error even from the wrong host one. The actual error is at the bottom of the error message. Here it is:

[Arvin@nwcal-wres-ti02 deployment]$ docker login nwcal-registry.[host] Username: Arvin Password: Error response from daemon: login attempt to https://nwcal-registry.[host]/v2/ failed with status: 401 Unauthorized

Edit: (When trying to run the entry_up script): docker: Error response from daemon: error while creating mount source path '/home/Arvin/.docker/config.json': mkdir /home/Arvin/.docker: permission denied.

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-13T13:04:33Z


Hank wrote:

I see no reference to "nwc.nws.gov" in anything we have in @/mnt/wres_share/deployment@.

Arvin: When you have a chance, please check your environment:

@env | grep "nwc.nws.gov"@

Does it show anything?

Hank

[Arvin@nwcal-wres-ti02 deployment]$ env | grep "nwc.nws.gov" [Arvin@nwcal-wres-ti02 deployment]$

Shows nothing

epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T13:08:23Z


Oh. Got it. Then it was just a misunderstanding.

The docker command in @entry_up.sh@ is referencing the @~/.docker@ directory which does not exist, and is created via the @docker login@. Just to see if you can get past that step, please create this folder and then run the @entry_down@ and @entry_up@ commands on the -ti02 (don't worry about -ti03):

mkdir ~/.docker
chmod 705 ~/.docker

I'm guessing it will then complain about the lack of a @config.json@, but let's see,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-13T13:10:12Z


I am running an evaluation right now (expecting a failure soon) once that happens I will try this and let you know :)

Thank you,

Arvin

epag commented 4 weeks ago

Original Redmine Comment Author Name: Arvin (Arvin) Original Date: 2023-06-13T14:29:20Z


Hank or James,

Do you know where I can find the log files to see the log statements?

Arvin

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T14:33:51Z


If you want a view of the logs printed to standard out as seen by docker, you can do @docker logs@. You will get a view across all containers. Otherwise, the application-specific logs will be inside the containers. We use logback for logging and our @logback.xml@ says this:

          <fileNamePattern>${user.home}/wres_logs/wres.%d{yyyy-MM-dd}.log</fileNamePattern>
epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T14:35:20Z


See James's response. Since you are probably looking for tasker logs, so just run

@docker logs deployment_tasker_1@

I should have covered that in our meeting last week; my bad. Thanks,

Hank

epag commented 4 weeks ago

Original Redmine Comment Author Name: James (James) Original Date: 2023-06-13T14:35:37Z


That said, it looks like the tasker (I assume you want to see those logs) only uses a console logger, so you will need to use @docker logs@ to see that.

<configuration>
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>

            <!-- %exception{full}: full stacktrace in log file. -->
            <pattern>%d{yyyy-MM-dd'T'HH:mm:ss.SSSZ} [%thread] %level %logger - %msg%n%exception{full}</pattern>
        </encoder>
    </appender>

    <!-- Keep eclipse/jetty libraries at info regardless of wres level -->
    <logger name="org.eclipse" level="info" />

    <!-- Allow -Dwres.logLevel to set logging level, otherwise info. -->
    <root level="${wres.logLevel:-info}">
        <appender-ref ref="STDOUT" />
    </root>
</configuration>

</code>
epag commented 4 weeks ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2023-06-13T19:42:07Z


Arvin,

I assume you are still working through the immutable List issue we spotted looking at the tasker logs. Can you let me know if and when you plan to deploy a fix to staging?

I would like to start a (what we call) Test A run tonight and don't want to step on your toes. Thanks,

Hank

P.S. Test A refers to a performance test we used in the past which combines 40 HEFS evaluations with one, very large WPOD NWM evaluation and three evaluations from MARFC. I've used it at various times to test the service and look for dramatic (unexpected) changes in performance. For example, see #113228-127 for a run of Test A when deploying 6.12.