Open anand-bhat opened 4 years ago
Project: 13421 (Run 1543, Clone 4; Gen 0)
@jchodera, @jcoffland : This WU was first completed and returned on 27/6, reissued after 1 day (deadline expiration). I've also seen this behaviour for other projects so its not specific to this one of this range. The other observations are a lot older.
Project: 13421 (Run 676, Clone 25, Gen 0)
I'll stop reporting any new ones unless there's a need to get more instances of this happening.
@jcoffland -- @jchodera asked me to post a few new cases where this has happened for projects hosted on aws3.
25th August: P13422 R4478 C92 G1
More observed reports in the forum. See [1], [2] and [3] for some possible leads. Issue appears to have gone away after John restarted the server.
27th->31st August: P13422 R4960 C23 G2
May not be the same cause as earlier.
Seen again on 2020-09-01: P13423 R1211 C27 G1
aws3.foldingathome.org appears to have core-dumped and restarted around the time of the duplication:
Log Started 2020-09-02T01:43:18Z
@jchodera @jcoffland Another occurrence on 2020-09-14.
I didn't see a core file dumped this time, and the server hasn't restarted since 2020-09-14T13:01:04Z
Could this instance be possibly due to the first successful return being sent to the CS but not relayed to the WS (I believe aws3 was down yesterday around the time these results were uploaded). serverstats reports this WS as not having its CS connected, but my result (project:13422 run:6841 clone:51 gen:2) was uploaded to the CS at 03:02:31 UTC when the WS was down yesterday.
@jchodera ~ I've just received P13427 R2539 C1 G0 but it appears to have been completed successfully around 5 hours ago. Can you please check?
The previous one I completed was also duplicated: P13427 R2509 C12 G1
Here's one I completed around 9AM UTC: P13427 R2305 C16 G1 that was subsequently reissued. Unfortunately, in this case, the reassigns resulted in faulty returns. Not sure what the retry is set to but if it's the default (5), I suspect this trajectory is now incorrectly marked as failed.
09:59:37:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
09:59:37:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:13427 run:2305 clone:16 gen:1 core:0x22 unit:0x0000000412bc7d9a5f586ce7a572c716
09:59:37:WU00:FS01:Uploading 437.88KiB to 18.188.125.154
09:59:37:WU00:FS01:Connecting to 18.188.125.154:8080
09:59:38:WU00:FS01:Upload complete
09:59:38:WU00:FS01:Server responded WORK_ACK (400)
09:59:38:WU00:FS01:Final credit estimate, 14498.00 points
09:59:38:WU00:FS01:Cleaning up
I haven't noticed any core dumps associated with these recent re-issues.
@jcoffland can you please assist in troubleshooting this?
We've now seen 6 different sets of duplications/ incorrect reassigns for projects on aws3, some occurring around the time there was a core dump, others like this without one. While this issue isn't limited to aws3, it's the one I'm closely watching as it hosts the Moonshot projects.
Some more from the same time period. I won't report other cases as I think these ought to be sufficient samples for troubleshooting.
@jcoffland : Since the server was restarted at 2020-09-16T13:59:18Z (about 13 hours ago), it looks like over 9000 WU results have been replaced:
$ grep Replacing ~/server2/log.txt | wc -l
9348
That's ~16% of the WUs serviced during that period. Any chance you could take a look at aws3?
I can confirm we're still ~16% of WUs serviced by aws3 are being reassigned despite successful completion.
This is most likely because the timeout is set too short.
@jcoffland, The timeout is set to 1 day however the reassigns are happening despite the work being returned successfully within 1 day (sometimes within an hour). The reassigns are also sometimes happening the same day the first assignment was done. The last occurance checks both these boxes and I suspect the cause is unlikely to be related to WUs timing out due to the project timeout setting.
@jcoffland @jchodera , there appears to be another occurrence around 25th September:
I figured out what was wrong here. There was another problem on aws3.foldingathome.org that was causing a huge backlog in processing WU results. This left returned WUs in the processing queue for a very long time. During that time the assignment could timeout and the WU would get reassigned. I've solved the backlog problem and modified the WS code so that it clears the assignment timeout as soon as the WU is return, instead of after it is processed.
Thanks @jcoffland, that is good to hear. Would this also have caused the cases where WUs were reassigned even though it appears to have been processed earlier?
For instance, the entire trajectory for P13427 R3818 C25 (until and including Gen 5) was completed on 2020-09-25, indicating all previous Gens were processed. However, when the reissues started, it started reprocessing from Gen 2 and caused reassigns and reprocessing for Gens 3, 4, and 5. If Gen 2 was processed earlier to create Gen 3, I wouldn't have expected it to be resent as it wouldn't have been in the processing WU results backlog.
Another example of this:
P13426 R6586 C3 G1: I've currently been assigned this WU and am working on it
P13426 R6586 C3 G2, P13426 R6586 C3 G3, P13426 R6586 C3 G4, P13426 R6586 C3 G5 have all been successfully completed but I suspect they'll be reassigned once I return Gen 1.
Once the gen has moved forward it shouldn't ever move back. So no, that doesn't make sense. I do see in the log file that the gen somehow got reset to 2 after the WS was restarted. That may mean it failed to write to it's DB. I'm looking into it.
Thanks. If you need another example, P13426 R6601 C28 -- all 5 Gens were completed but it has started reassignments from Gen 0.
@jcoffland , hopefully you've found something that causing this. I just returned P13426 R5601 C27 G3 and this was the 4th successful return, all having been well before the timeout, including two reassigns today alone.
@jcoffland I returned P13426 R5990 C18 G4 and noticed that Gen 5 was already done before I was assigned Gen 4. Earlier Gens were completed several times. Gen 0 Gen 1 Gen 2 Gen 3 Gen 4 Gen 5
Received P13426 R5308 C24 G4, which already has 2 successful returns and Gen 5 has already been returned twice. Does the work server check for the presence of a result before generating the next work unit from a late return?
Gen 4 Gen 5
@jcoffland I returned P13426 R5990 C18 G4 and noticed that Gen 5 was already done before I was assigned Gen 4. Earlier Gens were completed several times.
There's another assignment and completion of Gen 4... Gen 4 Gen 5
Just a quick note to point out that the reassigns are not artifacts from the 25th/ 26th. Even freshly completed WUs are continuing to be reassigned.
Currently working on P13426 R6015 C3 G5, which was already completed 3 days ago. Gen 0 is the worst I've seen for reassigns. The first successful OK was past the 1 day deadline so 2 of 7 completions are justified.
@jcoffland, some new cases of duplicated assignments -- seen for p13429 being run from aws3
P13429 R11287 C9 G1: It also appears that a return classified as faulty was granted full credits. It appears the return from the 2nd assignment that arrived at 04:13:16 that was late (came after the reassign deadline [1 day] after the WU was reassigned the 3rd time) triggered the 4th reassignment at 04:13:17 which then failed and caused the other two assignments. The 2nd return, if it were faulty, shouldn't have given full credits (maybe a different issue).
P13429 R12394 C7 G0: In this case, the reassignment was definitely not due to a deadline timeout as the WU was returned within the window (1 day)
Successfully completed WU is reissued. Observed for project 13411 for some WUs.
Your Environment
Expected Behavior
A WU, once successfully completed and returned, should not be reassigned.
Current Behavior
Some cases of WU reissues on successful result uploads were observed for 13411.
For example, 13411 (45, 0, 0) The first result was uploaded to the WS and not the CS so that rules out any potentials scenario where the result didn't flow back to the WS in time before the deadline.
Startup info for first WU:
13411 (4, 0, 1) Similar to the previous example, the first result was uploaded to the WS and not the CS.
Possible Solution (Optional)
Unknown. There was a bug in the older version of this core where the result returned did not contain sufficient info to create the next Gen, causing the WU to be reissued. However, this was fixed in a subsequent version of the core and the version used by the client (0.0.10) for this WU has not been reported to have this issue.
Steps To Reproduce
Unknown.
Context
A case where failed WUs are reissued to the same client was reported in #1531. It may be possible that whatever is causing the reissue of successful WUs (i.e., results not recorded) could not causing #1531 if that is expected to work by design.