Something is weird with 11410 and 11411

jchodera commented 9 years ago

11410 WUs are being given out with the wrong base points. Not sure why.

I asked slack donors to stop using project keys 11410 and 11411 so we can figure out what is going on.

We'll wait until the morning to purge the existing jobs and restart to give the donors time to upload WUs in progress and get credit.

juliebehr commented 9 years ago

Is there an action I should be taking here? The slack comments seem to suggest we've had similar issues with other projects; is it possible that our benchmark calculation simply needs to be recalibrated to be consistent with what other users expect to see on their own GPUs?

jchodera commented 9 years ago

The problem I am worried about is that the points they're seeing in the distributed WUs are wrong.

I'm going to do the following to purge all the jobs:

comment out the projects in server2/config.xml
restart the server to disable sending out new work units
rm -rf data/SVR2359493873/PROJ*
restart the server to purge the old jobs (we forgot this last time)
uncomment the projects in server2/config.xml
restart the server to regenerate the jobs

jchodera commented 9 years ago

I'm seeing this in the logs, but I'm not sure how important this is:

0:17:52:47:I3:Connecting to 140.163.4.241:8084
0:17:52:47:E :Exception: SSL connect failed: ERROR_SSL, error:14094410:SSL routines:SSL3_READ_BYTES:sslv3 alert handshake failure

jchodera commented 9 years ago

Jobs are being regenerated, but this SSL issue might be problematic. I have an email out to Joseph Coffland (who wrote the work server code) and Rick Knospler (our NJDC hardware guy who handles basic OS operations and SSL certs).

jchodera commented 9 years ago

We're back up and running again. The SSL issue was a red herring.

jchodera commented 9 years ago

People are still complaining about low PPD.

Can you guys try benchmarking on a GTX-980 containing machine, editing the benchmark.py script to say that a GTX-980 should get on average 300,000 PPD

jchodera commented 9 years ago

There's a GTX-980 in the machine in my office or on my desk if you need to steal it. Not sure if we have another dev box with a 980

jchodera commented 9 years ago

I'm wondering if benchmarking with the 780 is just really wonky and atypical.

juliebehr commented 9 years ago

We're currently working on relocating EGFR (with the 980 in it) because it isn't working where it is

jchodera commented 9 years ago

Was that the one in my office? If so, sorry! I reinstalled Linux but am not sure I managed to get CUDA7 installed or accounts set up.

There is also an AMD card in it, which could be problematic. Might need to take that out.

jchodera commented 9 years ago

Rebenchmarking can obviously wait until Monday!

pgrinaway commented 9 years ago

Ok there's no CUDA there. I can install it/fiddle with the GPUs on Monday.

jchodera commented 9 years ago

Might be easier just to put the GTX 980 into csk and take out the GTX-Titan if needed.

pgrinaway commented 9 years ago

Ok, that works too!

jchodera commented 9 years ago

Actually, we may be able to use the gtx980 node in the cluster to benchmark. You can get an interactive session with

qsub -I -l walltime=04:00:00,nodes=1:ppn=1:gpus=1:shared:gtx980 -l mem=4G -q active

and copy the fah-* directory from csk we've been using to benchmark---I think that version (not sure if it was centos or ubuntu) works on the cluster without running it in docker.

pgrinaway commented 9 years ago

Also, I've moved the GPU, but nvidia-smi doesn't show it. Will troubleshoot on monday.

juliebehr commented 9 years ago

I can't get either FAHClient to run on hal

jchodera commented 9 years ago

Shoot, my bad. Try this one from the cluster:

/cbio/jclab/home/chodera/fah-client

I just confirmed that this does indeed work for me on a GPU node.

juliebehr commented 9 years ago

I can get it to run on a 680 but not 980

jchodera commented 9 years ago

Whoops, it looks like you ran it in my directory instead of copying the directory to your own home directory. Can you do

chmod g+w -R /cbio/jclab/home/chodera/fah-client

Thanks!

jchodera commented 9 years ago

Hm. It isn't recognizing the GPU. I wonder if this is because there's a thread-exclusive process locking gpu device 0.

jchodera commented 9 years ago

The failure to assign to the GTX-980 may be an assignment server issue, though that is puzzling...

jchodera commented 9 years ago

OK, the issue is with not detecting the GPU correctly. I am guessing it is getting confused by the exclusive lock on gpu 0, but even setting CUDA_VISIBLE_DEVICES and playing with the gpu device id in config.xml doesn't help:

01:10:19:ERROR:Exception: GPU 2 not found
01:10:19:ERROR:       At: src/fah/client/slot/Slot.cpp:87:init()
01:10:19:ERROR:  #1 0x006d3edc in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR:  #2 0x006ddddf in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR:  #3 0x006846cb in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR:  #4 0x00677e84 in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR:  #5 0x006782a7 in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR:  #6 0x3d5e81ed5d in ??
01:10:19:ERROR:  #7 0x00677d29 in ?? at chodera/fah-client-2/FAHClient:0
01:10:19:ERROR:Caught at: src/fah/client/slot/SlotManager.cpp:302:init()
01:10:19:ERROR:No valid folding configuration
01:10:19:ERROR:No compute devices matched GPU #2 UNSUPPORTED: 0x0000:0x0000.  You may need to update your graphics drivers.
01:10:19:FS02:Set client configured

Not sure if this is easy to fix.

jchodera commented 9 years ago

Bug report filed here https://github.com/FoldingAtHome/fah-client-pub/issues/1149

choderalab / AURKA_UMN

Something is weird with 11410 and 11411 #13