Open jchodera opened 9 years ago
Is there an action I should be taking here? The slack comments seem to suggest we've had similar issues with other projects; is it possible that our benchmark calculation simply needs to be recalibrated to be consistent with what other users expect to see on their own GPUs?
The problem I am worried about is that the points they're seeing in the distributed WUs are wrong.
I'm going to do the following to purge all the jobs:
server2/config.xml
rm -rf data/SVR2359493873/PROJ*
server2/config.xml
I'm seeing this in the logs, but I'm not sure how important this is:
0:17:52:47:I3:Connecting to 140.163.4.241:8084
0:17:52:47:E :Exception: SSL connect failed: ERROR_SSL, error:14094410:SSL routines:SSL3_READ_BYTES:sslv3 alert handshake failure
Jobs are being regenerated, but this SSL issue might be problematic. I have an email out to Joseph Coffland (who wrote the work server code) and Rick Knospler (our NJDC hardware guy who handles basic OS operations and SSL certs).
We're back up and running again. The SSL issue was a red herring.
People are still complaining about low PPD.
Can you guys try benchmarking on a GTX-980 containing machine, editing the benchmark.py
script to say that a GTX-980 should get on average 300,000 PPD
There's a GTX-980 in the machine in my office or on my desk if you need to steal it. Not sure if we have another dev box with a 980
I'm wondering if benchmarking with the 780 is just really wonky and atypical.
We're currently working on relocating EGFR (with the 980 in it) because it isn't working where it is
Was that the one in my office? If so, sorry! I reinstalled Linux but am not sure I managed to get CUDA7 installed or accounts set up.
There is also an AMD card in it, which could be problematic. Might need to take that out.
Rebenchmarking can obviously wait until Monday!
Ok there's no CUDA there. I can install it/fiddle with the GPUs on Monday.
Might be easier just to put the GTX 980 into csk
and take out the GTX-Titan if needed.
Ok, that works too!
Actually, we may be able to use the gtx980
node in the cluster to benchmark. You can get an interactive session with
qsub -I -l walltime=04:00:00,nodes=1:ppn=1:gpus=1:shared:gtx980 -l mem=4G -q active
and copy the fah-*
directory from csk
we've been using to benchmark---I think that version (not sure if it was centos
or ubuntu
) works on the cluster without running it in docker
.
Also, I've moved the GPU, but nvidia-smi
doesn't show it. Will troubleshoot on monday.
I can't get either FAHClient
to run on hal
Shoot, my bad. Try this one from the cluster:
/cbio/jclab/home/chodera/fah-client
I just confirmed that this does indeed work for me on a GPU node.
I can get it to run on a 680 but not 980
Whoops, it looks like you ran it in my directory instead of copying the directory to your own home directory. Can you do
chmod g+w -R /cbio/jclab/home/chodera/fah-client
Thanks!
Hm. It isn't recognizing the GPU. I wonder if this is because there's a thread-exclusive process locking gpu device 0.
The failure to assign to the GTX-980 may be an assignment server issue, though that is puzzling...
OK, the issue is with not detecting the GPU correctly. I am guessing it is getting confused by the exclusive lock on gpu 0, but even setting CUDA_VISIBLE_DEVICES
and playing with the gpu device id in config.xml
doesn't help:
01:10:19:ERROR:Exception: GPU 2 not found
01:10:19:ERROR: At: src/fah/client/slot/Slot.cpp:87:init()
01:10:19:ERROR: #1 0x006d3edc in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR: #2 0x006ddddf in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR: #3 0x006846cb in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR: #4 0x00677e84 in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR: #5 0x006782a7 in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR: #6 0x3d5e81ed5d in ??
01:10:19:ERROR: #7 0x00677d29 in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR:Caught at: src/fah/client/slot/SlotManager.cpp:302:init()
01:10:19:ERROR:No valid folding configuration
01:10:19:ERROR:No compute devices matched GPU #2 UNSUPPORTED: 0x0000:0x0000. You may need to update your graphics drivers.
01:10:19:FS02:Set client configured
Not sure if this is easy to fix.
Bug report filed here https://github.com/FoldingAtHome/fah-client-pub/issues/1149
11410 WUs are being given out with the wrong base points. Not sure why.
I asked slack donors to stop using project keys 11410 and 11411 so we can figure out what is going on.
We'll wait until the morning to purge the existing jobs and restart to give the donors time to upload WUs in progress and get credit.