Rostlab / SeqVec

Modelling the Language of Life - Deep Learning Protein Sequences
http://embed.protein.properties
MIT License
116 stars 13 forks source link

Out of memory error for >50k sequences #10

Closed mortonjt closed 4 years ago

mortonjt commented 4 years ago

It seems that there are some memory issues when trying to process large dataset - below is an example of such an error.

 92%|█████████▏| 63995/69882 [27:32<08:01, 12.22it/s]/cm/local/apps/slurm/var/spool/job530579/slurm_script: line 26: 1295955 Killed                  seqvec -i $in_file -o $results_dir/embeddings.npz --protein True --id -1
slurmstepd: error: Detected 1 oom-kill event(s) in step 530579.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

Adjusting the batch size appears to make it worse (--batch-size=10 makes this fail at 40%). Splitting up the files into 10k chunks helps, but doesn't quite resolve this (the error above was generated doing that).

sacdallago commented 4 years ago

This issue is two fold:

  1. There might be too many sequences in the FASTA file and the node on which this is being computed runs out of main RAM (not GPU RAM). Solution: chop up FASTA file in smaller chunks
  2. There might be sequences in your FASTA file which are too long to be processed in GPU memory. For these, falling back to CPU might be a solution with the inherit limitation that it will be much slower (it can take up to days for single (very long) sequences). Alternatively, you can chop up long sequences into smaller parts, but this might introduce other unwanted effects.

To "solve" both issues, we currently plan on implementing a data preparation step which will re-sort the input FASTA from short to long sequences, then chop the computation in chunks of 5k sequences & outsource computation of sequences above 15k AA to CPU.

This will most likely be implemented in the "pipeline" https://github.com/sacdallago/bio_embeddings, rather than in this codebase which is kept a bit more flexible

mheinzinger commented 4 years ago

I think there might also be a misunderstanding about the --batch-size parameter: It gives the number of residues which are accumulated in a single batch before getting embedded. As we sort sequence by length, this means that we create larger batches at the beginning (shortest sequences) and smaller batches towards the end of your dataset. That being said, setting --batch-size=10 should lead to single sequence processing as your proteins should be longer than 10 residues. If you still run out of memory with this setting, you can proceed as Chris pointed out: remove long sequence (e.g. >15k residues) from your set for the moment and embed them separately and/or create even smaller chunks of e.g. 5k proteins.