Closed mortonjt closed 4 years ago
This issue is two fold:
To "solve" both issues, we currently plan on implementing a data preparation step which will re-sort the input FASTA from short to long sequences, then chop the computation in chunks of 5k sequences & outsource computation of sequences above 15k AA to CPU.
This will most likely be implemented in the "pipeline" https://github.com/sacdallago/bio_embeddings, rather than in this codebase which is kept a bit more flexible
I think there might also be a misunderstanding about the --batch-size parameter: It gives the number of residues which are accumulated in a single batch before getting embedded. As we sort sequence by length, this means that we create larger batches at the beginning (shortest sequences) and smaller batches towards the end of your dataset. That being said, setting --batch-size=10 should lead to single sequence processing as your proteins should be longer than 10 residues. If you still run out of memory with this setting, you can proceed as Chris pointed out: remove long sequence (e.g. >15k residues) from your set for the moment and embed them separately and/or create even smaller chunks of e.g. 5k proteins.
It seems that there are some memory issues when trying to process large dataset - below is an example of such an error.
Adjusting the batch size appears to make it worse (
--batch-size=10
makes this fail at 40%). Splitting up the files into 10k chunks helps, but doesn't quite resolve this (the error above was generated doing that).