FoldingAtHome / fah-issues

49 stars 9 forks source link

0x21 produced WU results fail to accept (WORK_QUIT) on v9.0.8 CS. #1182

Closed bb30994 closed 7 years ago

bb30994 commented 7 years ago

When communications fails after a WU starts uploading to the WS, the client retries to the CS. Unfortunately the anti-duplication flag has already been set preventing the CS from accepting the WU and the result is dumped.

This is essentially a high order race condition. Whenever EITHER server issues the first "Upload complete" ACK, it should award credit and initiate the client's cleanup process. The incomplete copy (in this case the one on the WS) should clear its copy and the WU Busy flag.

10:47:38:WU00:FS01:0x21:Completed 6250000 out of 6250000 steps (100%)
10:47:40:WU00:FS01:0x21:Saving result file logfile_01.txt
10:47:40:WU00:FS01:0x21:Saving result file checkpointState.xml
10:47:40:WU00:FS01:0x21:Saving result file checkpt.crc
10:47:40:WU00:FS01:0x21:Saving result file log.txt
10:47:40:WU00:FS01:0x21:Saving result file positions.xtc
10:47:40:WU00:FS01:0x21:Folding@home Core Shutdown: FINISHED_UNIT
10:47:40:WU00:FS01:FahCore returned: FINISHED_UNIT (100 = 0x64)
10:47:40:WU00:FS01:Sending unit results: id:00 state:SEND error:NO_ERROR project:9431 run:684 clone:0 gen:13 core:0x21 unit:0x00000014ab436c9d586fdd39dc8c25ac
10:47:40:WU00:FS01:Uploading 13.61MiB to 171.67.108.157
10:47:40:WU00:FS01:Connecting to 171.67.108.157:8080
10:47:46:WU00:FS01:Upload 3.67%
10:47:52:WU00:FS01:Upload 8.26%
10:47:58:WU00:FS01:Upload 12.85%
10:48:04:WU00:FS01:Upload 17.45%
10:48:10:WU00:FS01:Upload 22.04%
10:48:16:WU00:FS01:Upload 26.17%
10:48:48:WU00:FS01:Upload 30.30%
10:48:48:WARNING:WU00:FS01:Exception: Failed to send results to work server: Transfer failed
10:48:48:WU00:FS01:Trying to send results to collection server
10:48:48:WU00:FS01:Uploading 13.61MiB to 171.67.108.46
10:48:48:WU00:FS01:Connecting to 171.67.108.46:8080
10:48:54:WU00:FS01:Upload 3.67%
10:49:00:WU00:FS01:Upload 8.26%
10:49:06:WU00:FS01:Upload 12.40%
10:49:12:WU00:FS01:Upload 16.07%
10:49:18:WU00:FS01:Upload 20.20%
10:49:24:WU00:FS01:Upload 24.79%
10:49:30:WU00:FS01:Upload 29.38%
10:49:36:WU00:FS01:Upload 33.97%
10:49:42:WU00:FS01:Upload 38.56%
10:49:48:WU00:FS01:Upload 43.15%
10:49:54:WU00:FS01:Upload 47.29%
10:50:00:WU00:FS01:Upload 51.88%
10:50:06:WU00:FS01:Upload 56.47%
10:50:12:WU00:FS01:Upload 61.06%
10:50:18:WU00:FS01:Upload 65.65%
10:50:24:WU00:FS01:Upload 70.24%
10:50:30:WU00:FS01:Upload 74.83%
10:50:36:WU00:FS01:Upload 78.96%
10:50:42:WU00:FS01:Upload 83.56%
10:50:48:WU00:FS01:Upload 88.15%
10:50:54:WU00:FS01:Upload 92.74%
10:51:00:WU00:FS01:Upload 97.33%
10:51:05:WU00:FS01:Upload complete
10:51:05:WU00:FS01:Server responded WORK_QUIT (404)
10:51:05:WARNING:WU00:FS01:Server did not like results, dumping
10:51:05:WU00:FS01:Cleaning up
jcoffland commented 7 years ago

There is no such think as an anti-duplication flag or WU busy flag on the CS. The CS dumped the WU for other reasons. I found this in the CS log:

10:51:06:I2:#332796:Request WORK_RESULTS
10:51:06:I2:#332796:Client: 'Allan_Smith=0xf27c5a0657af9180@92.7.3.74' WORK_RESULTS
10:51:06:W :#332796:Job 0x586fdd39dc8c25ac-13 from client Allan_Smith=0xf27c5a0657af9180@92.7.3.74 not accepted due to exception: Tar file read failed: Success
10:51:06:I3:#332796:Response WORK_QUIT

This error indicates that the WU results were corrupt. To get to this point the WU must have had a payload but for some reason that payload did not contain a valid tar header. WU results are supposed contain a tar archive containing all the results files. Looking at the source code (TarHeader.cpp) the only way it can fail to read the tar header is if there are less than 512 bytes in the WU results after the F@H headers. Although, since the WU took over 2 minutes to upload it must have been much larger than that.

To get to the point it did the CS must have validated the checksums. This indicates that the core created and signed the WU results and that the data arrived at the CS intact. Since 9431 is an OpenMM 0x21 core project.

There are two likely possibilities:

  1. Core 0x21 has a bug.
  2. The CS (v9.0.8) has a bug.

If someone could capture one of these wuresults_01.dat files I could analyze the problem further.

bb30994 commented 7 years ago

OK. I don't have access to that server message and I have no way to collect results containing whatever error would be required to reproduce the error (the client normally deletes them)

The previous occurrences I've seen only happened when an upload to a WS started, failed, and later was sent to a CS and the WU was dumped with a WORK_QUIT status. Wouldn't the best place to collect such occurrences (which are rare) be on the server after a WORK_QUIT?

Some of the previous occurrences were on cores other than Core 0x21. That suggests a bug in either the wrapper or the CS.

Uploading 13.61MiB

AllanJM-Smith commented 7 years ago

"There is no such think as an anti-duplication flag or WU busy flag on the CS. The CS dumped the WU for other reasons. I found this in the CS log:"

Just out of interest how would the CS respond if a WU that had been sent to the WS was also sent to the CS? Would it differ from what we see?

BTW if you look at the original thread ( https://foldingforum.org/viewtopic.php?f=19&t=29682&p=293197#p293197 , "Rejection due to transmission failure?") you will see that this is not a unique occurance.

It is in the interests of both researchers (who want the results) and the volunteers (who get the points) that if this is a problem that it gets sorted out.

IMHO the section of the CS log does not tell the full story. There are also relevant entries on the WS log. Additionally it does not add anything to the original log entries.

It would also be nice to know what proportion of resent WUs fail on resend after to the CS after a partial transmission the the WS.

jcoffland commented 7 years ago

@AllanJM-Smith, I can fix this if someone can capture one of the failing wuresults_01.dat files before the client deletes it.

bb30994 commented 7 years ago

Is there a client option which will allow it to retain the uploads (e.g.- move them to some kind of history file so they can be sent to you or deleted manually)?

jcoffland commented 7 years ago

There's no easy or obvious way that I know of. Normally the client will delete the WU results right after the WS returned WORK_QUIT. You could trying either disconnecting the network or disabling it by configuring a invalid proxy after the WU has started. This would allow you to copy the WU results before the client attempts to upload. E.g.:

I have not tried this myself so let me know if it works.

bb30994 commented 7 years ago

That would capture a WU, but we don't have any way to predict if the current WU is going to create a WORK-QUIT scenario. A very large percentage of WUs upload successfully. You're suggesting that we capture EVERY WU we process. Allow it to upload and observe whether it fails or not. If it uploads successfully, discard the captured WU and move on. Repeat that process many, many, many times until one fails.

jay-mitchell commented 7 years ago

This is been happening to me on a few of my machines so i've tried to reproduce it. First, on a hunch that this is not even related to the connection interruptions, I blocked the work server with iptables so fahclient would upload the result to the collection server, which responded WORK_QUIT. It seems to me that collection server 171.67.108.46 is rejecting most if not all results that are assigned by work server 171.67.108.157 i've managed to collect a wuresults_01.dat by bringing my network interface down until after the WU finished so I could make a copy. I have uploaded it at :

https://www.dropbox.com/s/vufsr5j5qorenab/wuresults_01.dat?dl=0

Here are the relevant log entries for the workunit results I have uploaded

12:41:09:WU01:FS00:Sending unit results: id:01 state:SEND error:NO_ERROR project:9414 run:1627 clone:0 gen:91 core:0x21 unit:0x0000006dab436c9d585e069c073aef0e
12:41:09:WU01:FS00:Uploading 7.77MiB to 171.67.108.157
12:41:09:WU01:FS00:Connecting to 171.67.108.157:8080
12:41:10:WARNING:WU01:FS00:WorkServer connection failed on port 8080 trying 80
12:41:10:WU01:FS00:Connecting to 171.67.108.157:80
12:41:11:WARNING:WU01:FS00:Exception: Failed to send results to work server: Failed to connect to 171.67.108.157:80: Connection refused
12:41:11:WU01:FS00:Trying to send results to collection server
12:41:11:WU01:FS00:Uploading 7.77MiB to 171.67.108.46
12:41:11:WU01:FS00:Connecting to 171.67.108.46:8080
12:41:17:WU01:FS00:Upload 21.72%
12:41:23:WU01:FS00:Upload 41.03%
12:41:29:WU01:FS00:Upload 85.27%
12:41:31:WU01:FS00:Upload complete
12:41:31:WU01:FS00:Server responded WORK_QUIT (404)
12:41:31:WARNING:WU01:FS00:Server did not like results, dumping
12:41:31:WU01:FS00:Cleaning up

I've only seen this happen to WUs from work server 171.67.108.157 and 140.163.4.23*. I have seen WUs from work server 171.67.108.160 and 171.64.65.84 always successfully return to their collection servers when the work server was offline or the connection was interrupted.

If you would like any other wuresults_01.dat captured or even a wireshark capture just let me know.

In the event that any of the affected work servers went down for a extended period of time, it could result in huge amounts of lost work.

bb30994 commented 7 years ago

My understanding of what happens "behind the scenes" is limited. At the time of the previous log, we do know that the CS was operational and j-m's's client had an open channel to that CS. At that point, I can only guess what might have happened but I can suggest what might be worth looking at.

We can guess that the corresponding WS was probably operational but the channel between j-m's client and the WS was intentionally blocked. We do NOT know the state of the communications channel between the CS and the WS. Was that channel Open? Down? or in some other strange state? ... and would the answer to that question somehow influence the WORK_QUIT result?

If those are not a useful questions, forget it.

jay-mitchell commented 7 years ago

I don't think the problem is in the communication between the WS and CS as the CS will accept reporting work as dumped (I tested). The first question on my mind is if my WU produced the same server error as AllanJM-Smith's WU not accepted due to exception: Tar file read failed: Success (I guess jcoffland would know). If so then the issue can be reproduced easily and should not be too hard to fix.

This seems like a serious issue to me as whether the connection to the WS was interrupted or the WS was blocked in a firewall (which I'm assuming is for the purposes of this argument is equivalent to the WS being offline) the client tries to upload the WU to the CS and with WUs from WS 171.67.108.157(and others) they will be rejected by the CS. So if thats true, and an affected WS was offline for a period of time, all the WUs that are completed in that period of time would be dumped by the CS, wasting lots of work.

Sorry about all the assumptions, but because I don't have access to the source, its all I can do.

bb30994 commented 7 years ago

Ditto.

The purpose of the CS is to accept WUs when the WS happens to be offline and pass them on whenever it can. That function usually works. The question is: Why was this event different? For that we need Joseph's in-depth analysis.

jcoffland commented 7 years ago

@jay-mitchell, I'm would like to look this failure up in the CS log files but I need two pieces of information. 1) your F@H user name. 2) The date when this occurred.

jcoffland commented 7 years ago

@jay-mitchell, there is something wrong with the WU you posted on Dropbox. Either I'm doing something wrong, it was tampered with or the data was corrupted. This could explain why the WS and CS rejected the WU.

jay-mitchell commented 7 years ago

F@H username was Anonymous and the date was Sep 30, time was around 12:41:31 GMT. I never did any tampering all I did was block the WS in my firewall in an attempt reproduce an error I and others had experienced where after failing to submit a WU to the WS (eg. connection interruption) it would be rejected by the CS with WORK_QUIT. I downloaded the wuresults I upload on dropbox and it matches the copy on my hard drive (ssd), so I can't see how it could have been corrupted.

shasum : 59459ce3bd04af5dcb7ed67ebc7c88e8ed10e225

This is reproducible just setup a fahclient get a WU from 171.67.108.157 then block 171.67.108.157 in your firewall and watch the CS reject the result.

I also forgot to mention my platform earlier: 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u5 (2017-09-19) x86_64 GNU/Linux

jcoffland commented 7 years ago

I get the same shasum. I found the log entry it has the same Tar file read failed: Success error. I was able to unpack your WU so it appears to be ok. I see where the error occurs in the code but I'm not quite sure why. I've upgraded the CS to the latest code and the error has not occurred again, so far. I'll keep any eye on it.

bb30994 commented 7 years ago

(Error not occurring again doesn't mean much. These errors have been relatively rare but finding where it occurs in the code is a big step forward. Recovery after a Timeout was/is always problematic without more information.)

jay-mitchell commented 7 years ago

Updating the CS appears to have solved this issue as I am no longer able to reproduce it. Below is a log of a WU from WS 171.67.108.157 being accepted by CS 171.67.108.46

23:44:16:WU00:FS00:Sending unit results: id:00 state:SEND error:NO_ERROR project:9414 run:2024 clone:0 gen:247 core:0x21 unit:0x00000121ab436c9d585e069f681e8622
23:44:16:WU00:FS00:Uploading 7.75MiB to 171.67.108.157
23:44:16:WU00:FS00:Connecting to 171.67.108.157:8080
23:44:17:WARNING:WU00:FS00:WorkServer connection failed on port 8080 trying 80
23:44:17:WU00:FS00:Connecting to 171.67.108.157:80
23:44:19:WARNING:WU00:FS00:Exception: Failed to send results to work server: Failed to connect to 171.67.108.157:80: Connection refused
23:44:19:WU00:FS00:Trying to send results to collection server
23:44:19:WU00:FS00:Uploading 7.75MiB to 171.67.108.46
23:44:19:WU00:FS00:Connecting to 171.67.108.46:8080
23:44:25:WU00:FS00:Upload 4.84%
23:44:31:WU00:FS00:Upload 23.39%
23:44:37:WU00:FS00:Upload 83.88%
23:44:39:WU00:FS00:Upload complete
23:44:39:WU00:FS00:Server responded WORK_ACK (400)
23:44:39:WU00:FS00:Final credit estimate, 35037.00 points
23:44:39:WU00:FS00:Cleaning up
jcoffland commented 7 years ago

I vaguely remember fixing this issue some time ago. The CS was still running older code.