Closed alexyfyf closed 9 months ago
Hi, I apologies for the delayed response, and what may be an unsatisfactory answer.
In short: it does, but performance will mainly depends on your read size.
Complete answer:
Sequence storage and memory usage:
The core module of Porechop_ABI is based on SeqAn 2.4, and uses a SeqFileIn class to
infer file format and load reads in the appropriate Seqan StringSet format using readRecords().
The "records" are then referenced for further processing.
As SeqAn uses templates extensively, and each of them can be specialized in various ways,
it can be really hard to track down what it does exactly, and how it is done.
From the documentation, it seems the Strings specialisation we used in our
String Set store each base continuously in ram.
This mean the whole dataset has to be stored in memory at some point.
Note that it will be a lot smaller than the uncompressed file size,
as Seqan only use the amount of memory it needs to store data.
The DNA5 specialisation we use require only 3 bits per base
(2 bits for 'ATCG', and an additional bit for the 'N'),
instead of 8 bits for a standard char.
That's why the performances using a 50M read dataset depends on.
the total number of bases.
I feel like the core module should be able to cope with such dataset if given enough ram (32Gb would be a good start). I would be glad to try it for you, but I will need more details on your dataset to perform such test.
Thanks for your reply. I was trying on some pooled ONT data from SGNex https://github.com/GoekeLab/sg-nex-data. I want to test on a big dataset for some downstream analysis. But the first step is to run porechop or something similar to remove the adapter and cut chimeric reads. Also, I'm running on HPC with slurm management. So in theory, I can access relatively large RAM and number of cores. But 50M reads with Porechop failed OOM with 48 cores and 4GB per core. I'll test your tool later.
I see. As explained previously, our core module is storing dataset in a relatively compact manner, but that's absolutely not the case of the "legacy" part of porechop. All sequences are stored, full length, inside the RAM using python strings, which may not play well even with a highly capable system. I do know from experience that 5M reads with N50 around 2k bases will work on a good "home computer", but i never tried anything else that was bigger than the datasets presented in our article. We designed our program so it can process anything that the original porechop implementation can do, as they are highly interdependant. Testing its limit outside of this frame is uncharted territory, but i may have some suggestions:
If you need to clean your file using alignment-based trimming tools, you may want/need to split your dataset back into smaller files and re-concatenate them afterward.
If your dataset is a pool of several datasets, you may want to split it back for preliminary analysis anyway, because adapter sequences may not be the same between files. This may be an issue for our tool, especially if the file is partially trimmed (likely by the basecaller). Weak and sparse signal is not the best for our adapter inference strategy, that's why we don't recommand running it on barcoded datasets.
By running only our core module (porechop_abi -abi -go ...) on each sub-file, and comparing the inferred adapters, you may manually build a custom adapter file file that can be applied to all your subfiles. The result should be comparable to running it on the whole file, even if the adapter is not consistent.
Since this issue does not seem to require more comment from either side, I will now close it. Thanks for your interest in our tool.
Hi,
Very interesting tool. I'm still trying to run, but want to check if it load all sequences into ram as Porechop? How is the performance on say 50M ONT reads?