NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, when using "data direct", I want the posting of data to complete before an evaluation begins #91

Open epag opened 2 months ago

epag commented 2 months ago

Author Name: James (James) Original Redmine Issue: 108884, https://vlab.noaa.gov/redmine/issues/108884 Original Date: 2022-10-04


Given a client that behaves properly and issues sequential commands to post data, signal completion and then start an evaluation When the posting of data is complete Then the evaluation should begin after the posting is complete and not before So that the evaluation does not fail on attempting to read a partially written file


Related issue(s): #92 Redmine related issue(s): 108899


epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-10-04T21:33:50Z


See the discussion from #103897-101 onwards, notably #103897-135, which confirms that the evaluation succeeds when there is a large delay between posting and starting the evaluation.

My current speculation is that the server is failing to fully flush/close the stream before sending the "post done" (edit: i.e., http 200), but I vaguely recall looking into this before and seeing no evidence of a stream that is not fully flushed/closed. But it could be that the response code is sent from a separate thread that does not wait for the writing to be fully done, something like that, i.e., some kind of internal async that needs to be sync or coordinated.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-05T12:21:15Z


From Abram's last comment, it may not be a timing issue. He waited 10 minutes between posting data and posting @inputDone@, and it still failed.

Here is the code that puts the temporary file in place:

        try
        {
            // Content type detection is done in core WRES, not based on name.
            if ( Objects.nonNull( posixAttributes  ) )
            {
                temp = Files.createTempFile( jobId + "_", "", posixAttributes );
            }
            else
            {
                temp = Files.createTempFile( jobId + "_", "" );
            }

            Files.copy( data, temp, REPLACE_EXISTING );

            // After the copy, the permissions may not have stuck. Re-apply.
            if ( Objects.nonNull( permissions ) )
            {
                Files.setPosixFilePermissions( temp, permissions );
            }

            //I'm going to compute the md5 within the same try, since the inability
            //to compute the md5 is likely a sign that the file was not put in
            //place properly, so the catch, below, is appropriate.
            InputStream is = Files.newInputStream( temp );
            md5 = DigestUtils.md5Hex(is);

            LOGGER.debug( "Data in: {} (md5 = {})", temp, md5 );
        }
        catch ( IOException ioe )
        {
            LOGGER.warn( "While reading InputStream:", ioe );

            if ( Objects.nonNull( temp ) )
            {
                try
                {
                    Files.deleteIfExists( temp );
                }
                catch ( IOException ioe2 )
                {
                    LOGGER.warn( "Failed to delete {}", temp );
                }
            }

            return Response.status( Response.Status.INTERNAL_SERVER_ERROR )
                           .entity( "Failed to read from stream: " + ioe.getMessage() )
                           .build();
        }
</code>

The md5 code near the end of the @try@ is new for 6.8. @data@ is provided as an @InputStream@ in the arguments to the method.

Its just a @Files.copy@ for the passed in @data@. This still seems to me like some sort of corrupted transfer. Could it be that the one time we asked him to wait before posting @inputDone@ was the one time the transfer was good?

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-05T12:24:39Z


I don't have access to the production machines. Posting a ServiceNow ticket,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-05T12:30:03Z


My investigation is dead in the water until I have access to the prod machines.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-05T12:32:08Z


I can get to the proxy. Let me see what those logs tell me, if they are even working at the moment.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-05T12:35:34Z


The proxy logs tell me nothing useful. They only cover about the past half-hour and are filled with a bunch of meaningless nonsense.

I'll check the tasker logs once I have access to production.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-10-05T12:56:12Z


Yes, it does look more like packets dropped rather than a premature response from the server and I think we'd need to ask Abram to repeat the experiment several times to test that. I would try to reproduce in the same way Abram is doing this, via sequential commands and without a script. I predict you won't be able to reproduce. I asked Abram whether this always happens at home via vpn but I don't think he answered that yet.

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-10-05T13:02:16Z


One thing I think we should do is to add a log message here:

long bytesWritten = Files.copy( data, temp, REPLACE_EXISTING );
LOGGER.debug( "Wrote {} bytes to {}.", bytesWritten, temp );
</code>
epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-05T14:02:33Z


This is what I see in the tasker logs:

2022-10-04T21:25:36.235+0000 [qtp1504154691-18071] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "DELETE /job/73763217119662479/input HTTP/1.1" 405 0 "-" "curl/7.29.0"
2022-10-04T21:25:51.735+0000 [qtp1504154691-18071] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "POST /job/ HTTP/1.1" 201 266 "-" "curl/7.29.0"
2022-10-04T21:26:39.087+0000 [qtp1504154691-18071] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "POST /job/ HTTP/1.1" 201 266 "-" "curl/7.29.0"
2022-10-04T21:27:16.655+0000 [qtp1504154691-18071] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "POST /job/7594825360705465141/input/right HTTP/1.1" 200 0 "-" "curl/7.29.0"
2022-10-04T21:29:13.238+0000 [qtp1504154691-17986] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.137:5671
2022-10-04T21:29:13.243+0000 [qtp1504154691-17986] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1664918953241
2022-10-04T21:29:13.243+0000 [qtp1504154691-17986] INFO org.eclipse.jetty.server.RequestLog - 10.3.2.34 - - "GET /job HTTP/1.1" 200 2 "-" "check_http/v2.2 (monitoring-plugins 2.2)"
2022-10-04T21:34:13.093+0000 [qtp1504154691-17986] INFO wres.tasker.WresJob - Successfully connected to broker at broker/172.19.254.137:5671
2022-10-04T21:34:13.098+0000 [qtp1504154691-17986] INFO wres.tasker.WresJob - Successfully used live object service via persister:6379, got id dummyObjectId1664919253096
2022-10-04T21:34:13.099+0000 [qtp1504154691-17986] INFO org.eclipse.jetty.server.RequestLog - 10.3.2.34 - - "GET /job HTTP/1.1" 200 2 "-" "check_http/v2.2 (monitoring-plugins 2.2)"
2022-10-04T21:34:28.778+0000 [qtp1504154691-18071] INFO wres.config.ProjectConfigPlus - Unmarshalled project configuration from web post
2022-10-04T21:34:28.806+0000 [pool-2-thread-274] INFO wres.tasker.JobResults - Looking for exit code on topic job.7594825360705465141.exitCode
2022-10-04T21:34:28.809+0000 [qtp1504154691-18071] INFO wres.tasker.WresJob - Sent a message to queue 'wres.job' with properties '#contentHeader<basic>(content-type=null, content-encoding=null, headers=null, delivery-mode=2, priority=null, correlation-id=7594825360705465141, reply-to=wres.job.status, expiration=null, message-id=null, timestamp=null, type=null, user-id=null, app-id=null, cluster-id=null)'
2022-10-04T21:34:28.810+0000 [qtp1504154691-18071] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "POST /job/7594825360705465141/input HTTP/1.1" 200 0 "-" "curl/7.29.0"
2022-10-04T21:34:34.630+0000 [pool-1-thread-8] INFO wres.tasker.JobResultConsumer - Heard a message, consumerTag: amq.ctag-StwxwRVTm-Da1kwsBN8xpA, envelope: Envelope(deliveryTag=1, redeliver=false, exchange=wres.job.status, routingKey=job.7594825360705465141.exitCode), properties: #contentHeader<basic>(content-type=null, content-encoding=null, headers=null, delivery-mode=null, priority=null, correlation-id=7594825360705465141, reply-to=null, expiration=null, message-id=null, timestamp=null, type=null, user-id=null, app-id=null, cluster-id=null), message: [8, 5]
2022-10-04T21:34:34.820+0000 [pool-2-thread-221] INFO wres.tasker.JobStatusSharer - Ignoring report DEAD for job 7594825360705465141 because other mechanisms already handled it.
2022-10-04T21:34:43.807+0000 [pool-2-thread-222] INFO wres.tasker.JobMessageHelper - Finished waiting for job 7594825360705465141 output
2022-10-04T21:34:43.852+0000 [pool-2-thread-223] INFO wres.tasker.JobMessageHelper - Finished waiting for job 7594825360705465141 STDERR
2022-10-04T21:34:49.506+0000 [pool-2-thread-225] INFO wres.tasker.JobMessageHelper - Finished waiting for job 7594825360705465141 STDOUT
2022-10-04T21:34:49.820+0000 [pool-2-thread-221] INFO wres.tasker.JobMessageHelper - Finished waiting for job 7594825360705465141 status
2022-10-04T21:34:57.038+0000 [qtp1504154691-18071] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "GET /job/7594825360705465141/status HTTP/1.1" 200 26 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
2022-10-04T21:35:00.882+0000 [qtp1504154691-18006] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "GET /job/7594825360705465141/status HTTP/1.1" 200 26 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
2022-10-04T21:35:04.370+0000 [qtp1504154691-18056] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "GET /job/7594825360705465141/status HTTP/1.1" 200 26 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
2022-10-04T21:35:28.280+0000 [qtp1504154691-17986] INFO org.eclipse.jetty.server.RequestLog - 192.168.0.21 - - "GET /job/7594825360705465141/stdout HTTP/1.1" 200 11734 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"

Examining the sequence of events,

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-05T17:53:06Z


James wrote:

One thing I think we should do is to add a log message here:

[...]

Change made and pushed in commit:5cb7eef986beb734a8ecf2ea495d05eb57269c78.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-05T17:55:02Z


Its unclear if that will be the only change pushed associated with this ticket. If the problem is packets lost, then there isn't much we can do beyond share the md5 with the user so that they can compare it. We could ask that they provide the md5 with the posted data, but then the API becomes harder. I'm not sure we want to go that route.

Anyway, we are still investigating if its packets lost or something related to timing. Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-10-14T11:02:16Z


More in 6.9.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-24T13:00:27Z


Abram reported the issue, once again, though I note that the error message has changed for the better as of 6.8:

Caused by: java.lang.IllegalArgumentException: There is no reader implementation available for the prescribed data source disposition: (Disposition: UNKNOWN; URI: file:///mnt/wres_share/input_data/4126837376073498015_7441023230095525161; Type: ENSEMBLE_FORECASTS; Orientation: RIGHT).
    at wres.io.reading.TimeSeriesReaderFactory.getReader(TimeSeriesReaderFactory.java:145)
    at wres.io.ingesting.SourceLoader2.readAndIngestData(SourceLoader2.java:267)
    at wres.io.ingesting.SourceLoader2.loadFileSource(SourceLoader2.java:213)
    at wres.io.ingesting.SourceLoader2.loadSource(SourceLoader2.java:168)
    at wres.io.ingesting.SourceLoader2.load(SourceLoader2.java:145)
    at wres.io.Operations.doIngestWork(Operations.java:231)
    at wres.io.Operations.ingest(Operations.java:82)
    at wres.pipeline.ProcessorHelper.processProjectConfig(ProcessorHelper.java:425)
    ... 6 common frames omitted

I've tried to reproduce it three times, the first slowly (checking the MD5 before posting @inputDone@), the second two quickly (posting @inputDone@ immediately after posting the data with both commands separated only by a semicolon on the same command line). Of course, I failed to reproduce it before, so its not surprising I can't reproduce it now.

Those runs were via -ti01. I'm going to attempt to reproduce it from my laptop (slower network connection, I assume).

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-24T13:08:45Z


Definitely a much slower connection from my laptop, but the job still succeeded.

Could there be something with the LDAD machine or its version of tar leading to instability in the package created? Let me look at what Abram said in the other ticket; specifically, did he confirm the Content-MD5 for the most recent failed @right@ only evaluation?

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-10-24T13:10:51Z


He did. We need to confirm that the files you see match those matching md5s, but it points to the problem being unrelated to upload (or one of the problems). We probably need more debug logging on the file properties at evaluation time, both from the tasker side and the worker shim/worker side. It may be a timing issue afterall, although we've reviewed the code and it's hard to see how that happens.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-24T13:20:16Z


James,

I'm running the evaluation more quickly than he is. I entered each step from the command line one at a time, whereas I executed the final two steps with a semicolon in the middle. Surely, I'm allowing for less time between posting and execution.

I don't see his mentioning of the MD5 for that specific evaluation. The images he shared were for the baseline evaluation, I think. Perhaps I'm overlooking a comment. I'll scan it more closely.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-24T13:25:42Z


Nice...

This file sitting on the system from 10/18 has the exact same hash as what Abram reported in his ticket:

-rwxrwx---. 1       499 wres 40484448 Oct 18 17:10 8591703374375734598_2197063682475596338

I think this is for an evaluation he forgot to delete. Point is, I do have a version of his file, packaged on LDAD, that I can work with.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-10-24T13:26:09Z


Yes, the md5s were for the baseline evaluation. You'll need to get the separate md5 for the right-only evaluation if that file is different (I don't see why it would be, but it might be).

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-10-24T13:31:09Z


But since the files are uploaded each time, I do think you need to correlate one of those file-related error messages with an evaluation that shows a correct checksum on both ends. Only then can we be sure it is unrelated to file upload and related to something internal within cowres.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-10-24T13:40:35Z


Evaluation using that file succeeds.

What happens if I post @inputDone@ before the data finished posting. I think I've tested this before, but I'll do it again...

The evaluation fails because of no @source@ tag in the @right@ side. The posting of data reports a failure due to the evaluation no longer waiting for input:

4487679937420455528 is no longer awaiting data: COMPLETED_REPORTED_FAILURE

I guess Jesse anticipated this possibility.

I've asked Abram to look at the Content-MD5 for the failed, @right@-only evaluation.

I've done about all I can do today. I'll pick it up when I return to work tomorrow.

Hank

epag commented 2 months ago

Original Redmine Comment Author Name: James (James) Original Date: 2022-12-09T13:50:29Z


Not sure where we are with this one, but I haven't heard about issues recently. No doubt, a problem still exists, though.

Backlog until we have time to debug further.

epag commented 2 months ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2022-12-09T13:59:32Z


Right. The problem still likely exists, though I doubt it has to do with the Subject/Description of this ticket (evaluation starting before data done posting/flushing). It appears most likely to be dropped packets. Still, we need an investigation ticket, and this appears to serve that purpose.

I think I had asked Abram to let us know if he encounters the problem again, i.e., an unmatching MD5, so that we can take a look at the proxy and tasker logs to see if they say anything. If its dropped packets, then the logs likely won't tell us anything useful, but I at least wanted to check.

Hank