facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

ESMFold crashing with specific FASTA files #698

Open npcooley opened 2 months ago

npcooley commented 2 months ago

Bug description I am deploying ESMFold on the open science pool, and there are some sets of FASTA files that seem to always crash, they also appear to crash when run locally in a docker container. This doesn't appear to be a resource issue, but it can be difficult to tell from the Condor logs sometimes.

Reproduction steps Running ESMFold out of a docker container, using a command that generically looks like:

conda run -n py39-esmfold esm-fold -i <seqs.fa> -o <you/can/send/this/wherever> -m <some/mounted/volume> --cpu-only > result.txt

Expected behavior This command completes cleanly for some input files, but not others. It seems to uniformly fail with the error pasted below when running interactively. I've attached two fasta files, one for which the command runs cleanly, and one for which it fails.

Logs Failure output, for an interactive job: (base) root@6843f707dd31:/# conda run -n py39-esmfold esm-fold -i UserData/id0001partners00002.fa -o . -m ESModels --cpu-only 24/07/26 14:46:28 | INFO | root | Reading sequences from UserData/id0001partners00002.fa 24/07/26 14:46:28 | INFO | root | Loaded 2 sequences from UserData/id0001partners00002.fa 24/07/26 14:46:28 | INFO | root | Loading model 24/07/26 14:48:03 | INFO | root | Starting Predictions

/tmp/tmplazwrhgo: line 3: 39 Killed esm-fold -i UserData/id0001partners00002.fa -o . -m ESModels --cpu-only

ERROR conda.cli.main_run:execute(125): conda run esm-fold -i UserData/id0001partners00002.fa -o . -m ESModels --cpu-only failed. (See above for error)

Success output, for an interactive job: (base) root@6843f707dd31:/# conda run -n py39-esmfold esm-fold -i UserData/id0001partners00011.fa -o . -m ESModels --cpu-only 24/07/26 15:14:04 | INFO | root | Reading sequences from UserData/id0001partners00011.fa 24/07/26 15:14:04 | INFO | root | Loaded 2 sequences from UserData/id0001partners00011.fa 24/07/26 15:14:04 | INFO | root | Loading model 24/07/26 15:16:31 | INFO | root | Starting Predictions 24/07/26 15:30:19 | INFO | root | Predicted structure for 1_1_3688 with length 335, pLDDT 91.6, pTM 0.726 in 414.3s (amortized, batch size 2). 1 / 2 completed. 24/07/26 15:30:19 | INFO | root | Predicted structure for 2_1_23 with length 335, pLDDT 91.8, pTM 0.719 in 414.3s (amortized, batch size 2). 2 / 2 completed.

Output goes here

Additional context Technically when these jobs are running on the OSPool they're running out of singularity containers, as opposed to docker containers, though I don't know how much that matters. I get different kill codes on the OSPool, though that could be a site specific thing, i.e, when i interrogate my logs from Condor for jobs that Condor believes did not go over memory, I get:

$ cat LogFilesCB/out.2.err /srv/tmpgo8p_04o: line 3: 34 Killed esm-fold -i id0001partners00002.fa -o structs -m ESModels --cpu-only

ERROR conda.cli.main_run:execute(125): conda run esm-fold -i id0001partners00002.fa -o structs -m ESModels --cpu-only failed. (See above for error)

$ cat LogFilesCB/out.137.err /srv/tmpdklsx5w4: line 3: 34 Bus error (core dumped) esm-fold -i id0001partners00137.fa -o structs -m ESModels --cpu-only

ERROR conda.cli.main_run:execute(125): conda run esm-fold -i id0001partners00137.fa -o structs -m ESModels --cpu-only failed. (See above for error) /srv//Run.sh: line 35: 24 Bus error (core dumped) conda run -n py39-esmfold esm-fold -i "$TARGET" -o structs -m ESModels --cpu-only > "$INFILE1"

$ cat LogFilesCB/out.173.err /srv/tmp0pg3lc07: line 3: 36 Killed esm-fold -i id0001partners00173.fa -o structs -m ESModels --cpu-only

ERROR conda.cli.main_run:execute(125): conda run esm-fold -i id0001partners00173.fa -o structs -m ESModels --cpu-only failed. (See above for error)

The '00002' and '00011' FASTA files have been attached as 'txt' files because of file extension restrictions. id0001partners00002.txt id0001partners00011.txt

EDIT One additional extra piece of context is that when these jobs complete successfully, CPU usage is near the max possible for the requested resource. When they fail like this, CPU usage is nearly minimal.