GoekeLab / m6anet

Detection of m6A from direct RNA-Seq data
https://m6anet.readthedocs.io/
MIT License
104 stars 19 forks source link

Eventalign.txt file is too large #117

Closed SEVEN-XYCHEN closed 2 months ago

SEVEN-XYCHEN commented 1 year ago

Hello, I encountered a memory issue. My fastq file is 5.9G,bam file is 9.3G, transcript.fa file is 1.8 G, and there are 453G fast5 files under the fast5 folder. I first ran nanopolish index, it ran successfully. Then, I would run eventalign, but the generated eventalign.txt file is too large. It's already 835G before it's finished running. Is this reasonable? Is the generated intermediate file too large? Is there any way to improve it? Thank you very much for taking the time to answer my question! Looking forward to your reply.

chrishendra93 commented 1 year ago

hi @SEVEN-XYCHEN , apology for the delay in my reply. We do not have any control over the development of nanopolish but that being said, m6Anet indexes the nanopolish eventalign.txt file before preprocessing it, accessing only the relevant part of the files at each time to prevent memory overflow. So far I've succeeded running m6anet dataprep on 1 TB of eventalign.txt but let me know if you encounter any difficulty with it. Usually I don't keep the eventalign.txt file for too long after preprocessing

SEVEN-XYCHEN commented 1 year ago

Hi, @chrishendra93 Thank you very much for your reply, it has been very helpful to me. I will continue to use m6Anet for analysis, which is a great software. Best, Chen

VikArz02 commented 1 year ago

Hello, @chrishendra93!

I have the same issue, but i want understand how much memory i need for this file. Maybe you have this information or "formula" how count it? My eventalign.txt file is already 2TB before it's finished running.

characteristics of my file: reads-ref.sorted.bam 3.5 gb reads.fastq 5.7 gb fasta5 files 169 gb

chrishendra93 commented 1 year ago

hi @VikArz02 , I cannot really tell as it really depends on nanopolish eventalign ability to segment the raw files. If you have high-quality fast5 files, then it will be able to resolve most of the segments and your eventalign.txt file might take up a bit more storage space. This will not affect m6Anet memory requirements, but however it will affect its running time since you'll have more sites to process. This was raised in #128 as well so that m6Anet can be run from compressed nanopolish eventalign file, which we might explore in future release

Thanks!

lmulroney commented 10 months ago

Hi @chrishendra93, m6Anet has been a nice tool for us to use lately, but the datasets are getting bigger and bigger and file management is starting to become a problem with the eventalign step. Is there any chance m6anet dataprep could take the output from eventalign through a pipe? That way we never have to write the eventalign data to disk? This is something that both yanocomp and nanocompore do and it really helps with space management because the processed eventalign files tend to be one tenth the size of the raw eventalign file. Thanks for continuing active development with m6Anet and I'm looking forward to seeing what else is done with it!

jonathangoeke commented 10 months ago

Hi @lmulroney thanks for your comment! We hope to improve the file handling in a future version, but we don't have a release time line yet. We will post an update here