FoldingAtHome / fah-issues

49 stars 9 forks source link

Client does not like uploading on a congested network even when there is bandwidth avail. #335

Closed jcoffland closed 9 years ago

jcoffland commented 13 years ago
Trac Data
Ticket 335
Reported by @P5-133XL
Status closed
Component FAHClient
Priority 3 (major)
Milestone v7.1.5
Version 7.1

I was running a bit torrent application on different machine on the same network as the folding machine. I found that the folding client would consistently error-out when uploading WU's. I kept decreasing the outbound bandwidth used by the bit torrent application. As the avail BW increased, the client would take longer to fail but it would still fail. Eventually, I just suspended the torrent application and then the folding client had no problems uploading.

13:07:48:

13:07:48:

13:07:48:Enabled computation slot 00: READY smp:4

13:07:48:Enabled computation slot 01: READY gpu:0:NVIDIA_G92

13:07:48:Enabled computation slot 02: READY gpu:1:NVIDIA_G92

13:07:48:WARNING: Unit 00 missing data files, dumping

13:07:48:Started thread 1 on PID 4200

13:07:48:Sending unit results: ID:02 State:SEND Project:6701 Run:80 Clone:16 Gen:69 Core:0xa3 Unit:0x0118c6c04cfabc890045001000501a2d

13:07:48:Uploading 41.65MiB13:07:48:Connecting to 171.64.65.56:8080

......................................................................................................................................................................................................................................................................................done

13:07:51:Server connection id=1 on 0.0.0.0:36330 from 127.0.0.1:64391

13:07:51:Started thread 3 on PID 4200

13:08:14:WARNING: Exception: Failed to send results to work server: 0: Upload failed

13:08:14:Server connection id=2 on 0.0.0.0:36330 from 127.0.0.1:64392

13:08:14:Started thread 4 on PID 4200

13:08:14:Trying to send results to collection server

13:08:14:Uploading 41.65MiB13:08:14:Connecting to 171.67.108.25:8080

...done

13:08:15:ERROR: Exception: 0: Upload failed

13:08:15:Starting Unit: 05

13:08:15:Running core: C:/ProgramData/FAHClient/cores/www.stanford.edu/~pande/Win32/x86/Core_a3.fah/FahCore_a3.exe -dir 05 -suffix 01 -lifeline 4200 -version 459012 -checkpoint 15 -np 4

13:08:15:Started core on PID 1744

13:08:15:Started thread 5 on PID 4200

13:08:15:Core 0xa3 started

13:08:15:Starting Unit: 03

13:08:15:Running core: C:/ProgramData/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/G80/Core_11.fah/FahCore_11.exe -dir 03 -suffix 01 -lifeline 4200 -version 459012 -checkpoint 15 -gpu 1

13:08:15:Started core on PID 4172

13:08:15:Core 0x11 started

13:08:15:Started thread 6 on PID 4200

13:08:16:Starting Unit: 01

13:08:16:Running core: C:/ProgramData/FAHClient/cores/www.stanford.edu/~pande/Win32/AMD64/NVIDIA/G80/Core_11.fah/FahCore_11.exe -dir 01 -suffix 01 -lifeline 4200 -version 459012 -checkpoint 15 -gpu 0

13:08:16:Started core on PID 4820

13:08:16:Started thread 7 on PID 4200

13:08:16:Core 0x11 started

13:08:16:Sending unit results: ID:00 State:SEND Project:6702 Run:10 Clone:54 Gen:73 Core:0xa3 Unit:0x779503824cfddc3000490036000a1a2e

13:08:16:Uploading 41.66MiB13:08:16:Connecting to 171.64.65.56:8080

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................done

13:08:23:Unit 05:

13:08:23:Unit 03:

13:08:24:Unit 01:

13:10:41:WARNING: Exception: Failed to send results to work server: 0: Upload failed

13:10:41:Unit 05:------------------------------

13:10:41:Unit 03:------------------------------

13:10:41:Unit 01:------------------------------

13:10:41:Trying to send results to collection server

13:10:41:Unit 05:Folding@Home Gromacs SMP Core

13:10:41:Unit 03:Folding@Home GPU Core

13:10:41:Unit 01:Folding@Home GPU Core

13:10:41:Unit 05:Version 2.22 (Mar 12, 2010)

13:10:41:Unit 03:Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)

13:10:41:Unit 01:Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)

13:10:41:Unit 05:

13:10:41:Unit 03:

13:10:41:Unit 01:

13:10:41:Unit 05:Preparing to commence simulation

13:10:41:Unit 03:Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86

13:10:41:Unit 01:Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86

13:10:41:Unit 05:- Looking at optimizations...

13:10:41:Unit 03:Build host: amoeba

13:10:41:Unit 01:Build host: amoeba

13:10:41:Unit 05:- Created dyn

13:10:41:Unit 03:Board Type: Nvidia

13:10:41:Unit 01:Board Type: Nvidia

13:10:41:Unit 05:- Files status OK

13:10:41:Unit 03:Core :

13:10:41:Unit 01:Core :

13:10:41:Unit 05:- Expanded 1765688 -> 2254597 (decompressed 127.6 percent)

13:10:41:Unit 03:Preparing to commence simulation

13:10:41:Unit 01:Preparing to commence simulation

13:10:41:Unit 05:Called DecompressByteArray: compressed_data_size=1765688 data_size=2254597, decompressed_data_size=2254597 diff=0

13:10:41:Unit 03:- Looking at optimizations...

13:10:41:Unit 01:- Looking at optimizations...

13:10:41:Unit 05:- Digital signature verified

13:10:41:Unit 03:- Files status OK

13:10:41:Unit 01:- Files status OK

13:10:41:Unit 05:

13:10:41:Unit 03:- Expanded 63073 -> 336988 (decompressed 534.2 percent)

13:10:41:Unit 01:- Expanded 65402 -> 344335 (decompressed 526.4 percent)

13:10:41:Unit 05:Project: 6061 (Run 0, Clone 124, Gen 104)

13:10:41:Unit 03:Called DecompressByteArray: compressed_data_size=63073 data_size=336988, decompressed_data_size=336988 diff=0

13:10:41:Unit 01:Called DecompressByteArray: compressed_data_size=65402 data_size=344335, decompressed_data_size=344335 diff=0

13:10:41:Unit 05:

13:10:41:Unit 03:- Digital signature verified

13:10:41:Unit 01:- Digital signature verified

13:10:41:Unit 05:Assembly optimizations on if available.

13:10:41:Unit 03:

13:10:41:Unit 01:

13:10:41:Unit 05:Entering M.D.

13:10:41:Unit 03:Project: 10506 (Run 3, Clone 214, Gen 84)

13:10:41:Unit 01:Project: 5782 (Run 10, Clone 63, Gen 338)

13:10:41:Unit 05:Completed 0 out of 500000 steps (0%)

13:10:41:Unit 03:

13:10:41:Unit 01:

13:10:41:Unit 03:Assembly optimizations on if available.

13:10:41:Unit 01:Assembly optimizations on if available.

13:10:41:Unit 03:Entering M.D.

13:10:41:Unit 01:Entering M.D.

13:10:41:Unit 03:Will resume from checkpoint file

13:10:41:Unit 01:Will resume from checkpoint file

13:10:41:Unit 03:Tpr hash 03/wudata_01.tpr: 4001527577 1601609012 3738066044 3073432570 2790243872

13:10:41:Unit 01:Tpr hash 01/wudata_01.tpr: 886299789 2904344194 3281963922 1888798687 227880410

13:10:41:Unit 03:

13:10:41:Unit 01:

13:10:41:Unit 03:Calling fah_main args: 14 usage=100

13:10:41:Unit 01:Calling fah_main args: 14 usage=100

13:10:41:Unit 03:

13:10:41:Unit 01:

13:10:41:Unit 03:Working on Protein

13:10:41:Unit 01:Working on Giving Russians Opium May Alter Current Situation

13:10:41:Unit 03:Client config unavailable.

13:10:41:Unit 01:Client config unavailable.

13:10:41:Unit 03:Starting GUI Server

13:10:41:Unit 01:Starting GUI Server

13:10:41:Unit 03:Resuming from checkpoint

13:10:41:Unit 01:Resuming from checkpoint

13:10:41:Unit 03:fcCheckPointResume: retreived and current tpr file hash:

13:10:41:Unit 01:fcCheckPointResume: retreived and current tpr file hash:

13:10:41:Unit 03: 0 4001527577 4001527577

13:10:41:Unit 01: 0 886299789 886299789

13:10:41:Unit 03: 1 1601609012 1601609012

13:10:41:Unit 01: 1 2904344194 2904344194

13:10:41:Unit 03: 2 3738066044 3738066044

13:10:41:Unit 01: 2 3281963922 3281963922

13:10:41:Unit 03: 3 3073432570 3073432570

13:10:41:Unit 01: 3 1888798687 1888798687

13:10:41:Unit 03: 4 2790243872 2790243872

13:10:41:Unit 01: 4 227880410 227880410

13:10:41:Unit 03:fcCheckPointResume: file hashes same.

13:10:41:Unit 01:fcCheckPointResume: file hashes same.

13:10:41:Unit 03:fcCheckPointResume: state restored.

13:10:41:Unit 01:fcCheckPointResume: state restored.

13:10:41:Unit 03:Verified 03/wudata_01.log

13:10:41:Unit 01:Verified 01/wudata_01.log

13:10:41:Uploading 41.66MiB13:10:41:Connecting to 171.67.108.25:8080

...done

13:10:41:Unit 03:Verified 03/wudata_01.edr

13:10:41:Unit 01:Verified 01/wudata_01.edr

13:10:42:ERROR: Exception: 0: Upload failed

13:10:42:Unit 03:Verified 03/wudata_01.xtc

13:10:42:Unit 01:Verified 01/wudata_01.xtc

13:10:42:Unit 03:Completed 73%

13:10:42:Unit 01:Completed 98%

13:10:42:Sending unit results: ID:02 State:SEND Project:6701 Run:80 Clone:16 Gen:69 Core:0xa3 Unit:0x0118c6c04cfabc890045001000501a2d

13:10:42:Uploading 41.65MiB13:10:42:Connecting to 171.64.65.56:8080

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

jcoffland commented 13 years ago

Comment by @jcoffland An transfer will be aborted by the client if it is unable to send anything for 30 seconds. I did recently fix a problem with the socket timeout handling so your problem might be solved. Please retest after v7.1.5 is released.

jcoffland commented 13 years ago

Comment by @jcoffland Resolved in v7.1.5.