FoldingAtHome / fah-issues

49 stars 9 forks source link

Successfully completed WU is reissued #1532

Open anand-bhat opened 4 years ago

anand-bhat commented 4 years ago

Successfully completed WU is reissued. Observed for project 13411 for some WUs.

Your Environment


Expected Behavior

A WU, once successfully completed and returned, should not be reassigned.


Current Behavior

Some cases of WU reissues on successful result uploads were observed for 13411.

For example, 13411 (45, 0, 0) image The first result was uploaded to the WS and not the CS so that rules out any potentials scenario where the result didn't flow back to the WS in time before the deadline.

05:11:38:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:13411 run:45 clone:0 gen:0 core:0x22 unit:0x0000000212bc7d9a5ee407ed538d849b
05:11:39:WU01:FS00:Uploading 554.72KiB to 18.188.125.154
05:11:39:WU01:FS00:Connecting to 18.188.125.154:8080
05:11:39:WU01:FS00:Upload complete
05:11:39:WU01:FS00:Server responded WORK_ACK (400)
05:11:39:WU01:FS00:Final credit estimate, 14933.00 points

Startup info for first WU:

04:11:44:WU01:FS00:0x22:*********************** Log Started 2020-06-25T04:11:43Z ***********************
04:11:44:WU01:FS00:0x22:*************************** Core22 Folding@home Core ***************************
04:11:44:WU01:FS00:0x22:       Core: Core22
04:11:44:WU01:FS00:0x22:       Type: 0x22
04:11:44:WU01:FS00:0x22:    Version: 0.0.10
04:11:44:WU01:FS00:0x22:     Author: Joseph Coffland <joseph@cauldrondevelopment.com>
04:11:44:WU01:FS00:0x22:  Copyright: 2020 foldingathome.org
04:11:44:WU01:FS00:0x22:   Homepage: https://foldingathome.org/
04:11:44:WU01:FS00:0x22:       Date: Jun 16 2020
04:11:44:WU01:FS00:0x22:       Time: 15:55:31
04:11:44:WU01:FS00:0x22:   Revision: 147051aad40bcbec7d4b25105bbedfab425f1dc2
04:11:44:WU01:FS00:0x22:     Branch: core22-0.0.10
04:11:44:WU01:FS00:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
04:11:44:WU01:FS00:0x22:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
04:11:44:WU01:FS00:0x22:   Platform: linux2 4.19.76-linuxkit
04:11:44:WU01:FS00:0x22:       Bits: 64
04:11:44:WU01:FS00:0x22:       Mode: Release
04:11:44:WU01:FS00:0x22:Maintainers: John Chodera <john.chodera@choderalab.org> and Peter Eastman
04:11:44:WU01:FS00:0x22:             <peastman@stanford.edu>
04:11:44:WU01:FS00:0x22:       Args: -dir 01 -suffix 01 -version 706 -lifeline 4492 -checkpoint 3
04:11:44:WU01:FS00:0x22:             -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device
04:11:44:WU01:FS00:0x22:             0 -gpu 0
04:11:44:WU01:FS00:0x22:************************************ libFAH ************************************
04:11:44:WU01:FS00:0x22:       Date: Jun 2 2020
04:11:44:WU01:FS00:0x22:       Time: 00:07:31
04:11:44:WU01:FS00:0x22:   Revision: 2b383f4f04f38511dff592885d7c0400e72bdf43
04:11:44:WU01:FS00:0x22:     Branch: HEAD
04:11:44:WU01:FS00:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
04:11:44:WU01:FS00:0x22:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
04:11:44:WU01:FS00:0x22:   Platform: linux2 4.19.76-linuxkit
04:11:44:WU01:FS00:0x22:       Bits: 64
04:11:44:WU01:FS00:0x22:       Mode: Release
04:11:44:WU01:FS00:0x22:************************************ CBang *************************************
04:11:44:WU01:FS00:0x22:       Date: May 31 2020
04:11:44:WU01:FS00:0x22:       Time: 20:16:34
04:11:44:WU01:FS00:0x22:   Revision: 75fcee0b8e713cb47f5191a3689d5f4f07244c7f
04:11:44:WU01:FS00:0x22:     Branch: HEAD
04:11:44:WU01:FS00:0x22:   Compiler: GNU 4.8.2 20140120 (Red Hat 4.8.2-15)
04:11:44:WU01:FS00:0x22:    Options: -std=c++11 -ffunction-sections -fdata-sections -O3 -funroll-loops
04:11:44:WU01:FS00:0x22:             -fPIC
04:11:44:WU01:FS00:0x22:   Platform: linux2 4.19.76-linuxkit
04:11:44:WU01:FS00:0x22:       Bits: 64
04:11:44:WU01:FS00:0x22:       Mode: Release
04:11:44:WU01:FS00:0x22:************************************ System ************************************
04:11:44:WU01:FS00:0x22:        CPU: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
04:11:44:WU01:FS00:0x22:     CPU ID: GenuineIntel Family 6 Model 63 Stepping 2
04:11:44:WU01:FS00:0x22:       CPUs: 6
04:11:44:WU01:FS00:0x22:     Memory: 54.92GiB
04:11:44:WU01:FS00:0x22:Free Memory: 52.81GiB
04:11:44:WU01:FS00:0x22:    Threads: POSIX_THREADS
04:11:44:WU01:FS00:0x22: OS Version: 5.3
04:11:44:WU01:FS00:0x22:Has Battery: false
04:11:44:WU01:FS00:0x22: On Battery: false
04:11:44:WU01:FS00:0x22: UTC Offset: 0
04:11:44:WU01:FS00:0x22:        PID: 4496
04:11:44:WU01:FS00:0x22:        CWD: /var/lib/fahclient/work
04:11:44:WU01:FS00:0x22:********************************************************************************
04:11:44:WU01:FS00:0x22:Project: 13411 (Run 45, Clone 0, Gen 0)
04:11:44:WU01:FS00:0x22:Unit: 0x0000000212bc7d9a5ee407ed538d849b

13411 (4, 0, 1) image Similar to the previous example, the first result was uploaded to the WS and not the CS.

04:28:14:WU01:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
04:28:14:WU01:FS01:Sending unit results: id:01 state:SEND error:NO_ERROR project:13411 run:4 clone:0 gen:0 core:0x$
04:28:14:WU01:FS01:Uploading 552.16KiB to 18.188.125.154
04:28:14:WU01:FS01:Connecting to 18.188.125.154:8080
04:28:18:WU01:FS01:Upload complete
04:28:18:WU01:FS01:Server responded WORK_ACK (400)
04:28:18:WU01:FS01:Final credit estimate, 22823.00 points

Possible Solution (Optional)

Unknown. There was a bug in the older version of this core where the result returned did not contain sufficient info to create the next Gen, causing the WU to be reissued. However, this was fixed in a subsequent version of the core and the version used by the client (0.0.10) for this WU has not been reported to have this issue.


Steps To Reproduce

Unknown.


Context

A case where failed WUs are reissued to the same client was reported in #1531. It may be possible that whatever is causing the reissue of successful WUs (i.e., results not recorded) could not causing #1531 if that is expected to work by design.


anand-bhat commented 4 years ago

Project: 13421 (Run 1543, Clone 4; Gen 0)

image

@jchodera, @jcoffland : This WU was first completed and returned on 27/6, reissued after 1 day (deadline expiration). I've also seen this behaviour for other projects so its not specific to this one of this range. The other observations are a lot older.

anand-bhat commented 4 years ago

Project: 13421 (Run 676, Clone 25, Gen 0)

image

I'll stop reporting any new ones unless there's a need to get more instances of this happening.

anand-bhat commented 4 years ago

@jcoffland -- @jchodera asked me to post a few new cases where this has happened for projects hosted on aws3.

25th August: P13422 R4478 C92 G1 image

More observed reports in the forum. See [1], [2] and [3] for some possible leads. Issue appears to have gone away after John restarted the server.

27th->31st August: P13422 R4960 C23 G2 image

May not be the same cause as earlier.

anand-bhat commented 4 years ago

Seen again on 2020-09-01: P13423 R1211 C27 G1

image

P13423 R1207 C33 G1 image

jchodera commented 4 years ago

aws3.foldingathome.org appears to have core-dumped and restarted around the time of the duplication:

Log Started 2020-09-02T01:43:18Z
anand-bhat commented 4 years ago

Seen again on 2020-09-08:

P13423 R1350 C14 G1 image

P13423 R1350 C34 G1 image

anand-bhat commented 4 years ago

@jchodera @jcoffland Another occurrence on 2020-09-14.

P13427 R1800 C22 G1: image

P13427 R2155 C4 G5: image

jchodera commented 4 years ago

@jchodera @jcoffland Another occurrence on 2020-09-14.

I didn't see a core file dumped this time, and the server hasn't restarted since 2020-09-14T13:01:04Z

anand-bhat commented 4 years ago

Could this instance be possibly due to the first successful return being sent to the CS but not relayed to the WS (I believe aws3 was down yesterday around the time these results were uploaded). serverstats reports this WS as not having its CS connected, but my result (project:13422 run:6841 clone:51 gen:2) was uploaded to the CS at 03:02:31 UTC when the WS was down yesterday.

anand-bhat commented 4 years ago

@jchodera ~ I've just received P13427 R2539 C1 G0 but it appears to have been completed successfully around 5 hours ago. Can you please check?

image

The previous one I completed was also duplicated: P13427 R2509 C12 G1 image

anand-bhat commented 4 years ago

Here's one I completed around 9AM UTC: P13427 R2305 C16 G1 that was subsequently reissued. Unfortunately, in this case, the reassigns resulted in faulty returns. Not sure what the retry is set to but if it's the default (5), I suspect this trajectory is now incorrectly marked as failed.

09:59:37:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
09:59:37:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:13427 run:2305 clone:16 gen:1 core:0x22 unit:0x0000000412bc7d9a5f586ce7a572c716
09:59:37:WU00:FS01:Uploading 437.88KiB to 18.188.125.154
09:59:37:WU00:FS01:Connecting to 18.188.125.154:8080
09:59:38:WU00:FS01:Upload complete
09:59:38:WU00:FS01:Server responded WORK_ACK (400)
09:59:38:WU00:FS01:Final credit estimate, 14498.00 points
09:59:38:WU00:FS01:Cleaning up

image

jchodera commented 4 years ago

I haven't noticed any core dumps associated with these recent re-issues.

anand-bhat commented 4 years ago

@jcoffland can you please assist in troubleshooting this?

We've now seen 6 different sets of duplications/ incorrect reassigns for projects on aws3, some occurring around the time there was a core dump, others like this without one. While this issue isn't limited to aws3, it's the one I'm closely watching as it hosts the Moonshot projects.

anand-bhat commented 4 years ago

Some more from the same time period. I won't report other cases as I think these ought to be sufficient samples for troubleshooting.

P13427 R2349 C18 G2 image

P13427 R2550 C8 G2 image

jchodera commented 4 years ago

@jcoffland : Since the server was restarted at 2020-09-16T13:59:18Z (about 13 hours ago), it looks like over 9000 WU results have been replaced:

$ grep Replacing ~/server2/log.txt  | wc -l
9348

That's ~16% of the WUs serviced during that period. Any chance you could take a look at aws3?

jchodera commented 4 years ago

I can confirm we're still ~16% of WUs serviced by aws3 are being reassigned despite successful completion.

jcoffland commented 4 years ago

This is most likely because the timeout is set too short.

anand-bhat commented 4 years ago

@jcoffland, The timeout is set to 1 day however the reassigns are happening despite the work being returned successfully within 1 day (sometimes within an hour). The reassigns are also sometimes happening the same day the first assignment was done. The last occurance checks both these boxes and I suspect the cause is unlikely to be related to WUs timing out due to the project timeout setting.

anand-bhat commented 4 years ago

@jcoffland @jchodera , there appears to be another occurrence around 25th September:

P13427 R3818 C25 G2: image

P13427 R3818 C25 G3: image

P13427 R3818 C25 G4: image

P13426 R4979 C5 G1: image

jcoffland commented 4 years ago

I figured out what was wrong here. There was another problem on aws3.foldingathome.org that was causing a huge backlog in processing WU results. This left returned WUs in the processing queue for a very long time. During that time the assignment could timeout and the WU would get reassigned. I've solved the backlog problem and modified the WS code so that it clears the assignment timeout as soon as the WU is return, instead of after it is processed.

anand-bhat commented 4 years ago

Thanks @jcoffland, that is good to hear. Would this also have caused the cases where WUs were reassigned even though it appears to have been processed earlier?

For instance, the entire trajectory for P13427 R3818 C25 (until and including Gen 5) was completed on 2020-09-25, indicating all previous Gens were processed. However, when the reissues started, it started reprocessing from Gen 2 and caused reassigns and reprocessing for Gens 3, 4, and 5. If Gen 2 was processed earlier to create Gen 3, I wouldn't have expected it to be resent as it wouldn't have been in the processing WU results backlog.

anand-bhat commented 4 years ago

Another example of this:

P13426 R6586 C3 G0: image

P13426 R6586 C3 G1: I've currently been assigned this WU and am working on it image

P13426 R6586 C3 G2, P13426 R6586 C3 G3, P13426 R6586 C3 G4, P13426 R6586 C3 G5 have all been successfully completed but I suspect they'll be reassigned once I return Gen 1.

jcoffland commented 4 years ago

Once the gen has moved forward it shouldn't ever move back. So no, that doesn't make sense. I do see in the log file that the gen somehow got reset to 2 after the WS was restarted. That may mean it failed to write to it's DB. I'm looking into it.

anand-bhat commented 4 years ago

Thanks. If you need another example, P13426 R6601 C28 -- all 5 Gens were completed but it has started reassignments from Gen 0.

anand-bhat commented 4 years ago

@jcoffland , hopefully you've found something that causing this. I just returned P13426 R5601 C27 G3 and this was the 4th successful return, all having been well before the timeout, including two reassigns today alone.

image

bdenhollander commented 4 years ago

@jcoffland I returned P13426 R5990 C18 G4 and noticed that Gen 5 was already done before I was assigned Gen 4. Earlier Gens were completed several times. Gen 0 Gen0 Gen 1 Gen1 Gen 2 Gen2 Gen 3 Gen3 Gen 4 Gen4 Gen 5 Gen5

bdenhollander commented 4 years ago

Received P13426 R5308 C24 G4, which already has 2 successful returns and Gen 5 has already been returned twice. Does the work server check for the presence of a result before generating the next work unit from a late return?

Gen 4 R5308C24Gen4 Gen 5 R5308C24Gen5

bdenhollander commented 4 years ago

@jcoffland I returned P13426 R5990 C18 G4 and noticed that Gen 5 was already done before I was assigned Gen 4. Earlier Gens were completed several times.

There's another assignment and completion of Gen 4... Gen 4 R5990C18G4 Gen 5 R5990C18G5

anand-bhat commented 4 years ago

Just a quick note to point out that the reassigns are not artifacts from the 25th/ 26th. Even freshly completed WUs are continuing to be reassigned.

P13427 R5301 C18 G0 : image

bdenhollander commented 4 years ago

Currently working on P13426 R6015 C3 G5, which was already completed 3 days ago. Gen 0 is the worst I've seen for reassigns. The first successful OK was past the 1 day deadline so 2 of 7 completions are justified.

P13426 R6015 C3 G0 R6015C3G0

anand-bhat commented 4 years ago

@jcoffland, some new cases of duplicated assignments -- seen for p13429 being run from aws3

P13429 R11287 C9 G1: It also appears that a return classified as faulty was granted full credits. image It appears the return from the 2nd assignment that arrived at 04:13:16 that was late (came after the reassign deadline [1 day] after the WU was reassigned the 3rd time) triggered the 4th reassignment at 04:13:17 which then failed and caused the other two assignments. The 2nd return, if it were faulty, shouldn't have given full credits (maybe a different issue).

P13429 R12394 C7 G0: image In this case, the reassignment was definitely not due to a deadline timeout as the WU was returned within the window (1 day)