bonsai-team / Porechop_ABI

Adapter trimmer for Oxford Nanopore reads using ab initio method
GNU General Public License v3.0
33 stars 3 forks source link

does it load all sequence into ram as porechop? #8

Closed alexyfyf closed 9 months ago

alexyfyf commented 1 year ago

Hi,

Very interesting tool. I'm still trying to run, but want to check if it load all sequences into ram as Porechop? How is the performance on say 50M ONT reads?

qbonenfant commented 1 year ago

Hi, I apologies for the delayed response, and what may be an unsatisfactory answer.

In short: it does, but performance will mainly depends on your read size.

Complete answer:

Sequence storage and memory usage:

The core module of Porechop_ABI is based on SeqAn 2.4, and uses a SeqFileIn  class to
infer file format and load reads in the appropriate Seqan StringSet format using readRecords().
The "records" are then referenced for further processing.

As SeqAn uses templates extensively, and each of them can be specialized in various ways,
it can be really hard to track down what it does  exactly, and how it is done.

From the documentation, it seems the Strings specialisation we used in our
String Set store each base continuously in ram.
This mean the whole dataset has to be stored in memory at some point.

Note that it will be a lot smaller than the uncompressed file size,
as Seqan only use the amount of memory it needs to store data.
The DNA5 specialisation we use require only 3 bits per base
(2 bits for 'ATCG', and an additional bit for the 'N'),
instead of 8 bits for a standard char.

That's why the performances using a 50M read dataset depends on.
the total number of bases.

I feel like the core module should be able to cope with such dataset if given enough ram (32Gb would be a good start). I would be glad to try it for you, but I will need more details on your dataset to perform such test.

alexyfyf commented 1 year ago

Thanks for your reply. I was trying on some pooled ONT data from SGNex https://github.com/GoekeLab/sg-nex-data. I want to test on a big dataset for some downstream analysis. But the first step is to run porechop or something similar to remove the adapter and cut chimeric reads. Also, I'm running on HPC with slurm management. So in theory, I can access relatively large RAM and number of cores. But 50M reads with Porechop failed OOM with 48 cores and 4GB per core. I'll test your tool later.

qbonenfant commented 1 year ago

I see. As explained previously, our core module is storing dataset in a relatively compact manner, but that's absolutely not the case of the "legacy" part of porechop. All sequences are stored, full length, inside the RAM using python strings, which may not play well even with a highly capable system. I do know from experience that 5M reads with N50 around 2k bases will work on a good "home computer", but i never tried anything else that was bigger than the datasets presented in our article. We designed our program so it can process anything that the original porechop implementation can do, as they are highly interdependant. Testing its limit outside of this frame is uncharted territory, but i may have some suggestions:

qbonenfant commented 9 months ago

Since this issue does not seem to require more comment from either side, I will now close it. Thanks for your interest in our tool.