GoekeLab / xpore

Identification of differential RNA modifications from nanopore direct RNA sequencing
https://xpore.readthedocs.io/
MIT License
132 stars 22 forks source link

Typical run time and file size #43

Closed callumparr closed 3 years ago

callumparr commented 3 years ago

Do you have any rough estimate for run time and also expected file size.

My hdf5 is generated and round 100Gb but the data.json file has been running for 2 days and is creeping up to 600Gb. Eventalign file from nanopolish is close to 300Gb. I am little worried what the sys.admin will say. Ideally I like to keep the files for later if I need to run diff.mod again. Do you normally reduce the number of reads to consider per transcript?

Library size is around 2M total reads and around 1.5M passed basecalled reads.

Is this expected file sizes?

ploy-np commented 3 years ago

We have improved the xpore-dataprep script to run faster and require less storage space. But it's in the "fastdataprep" branch. In my experiment, with the new script, it took 4 hours using 1 CPU, 5GB memory for 890,267 reads. For the outputs, it first processed 196GB nanopolish_eventalign.txt into 26GB nanopolish.combine then created 3.5 GB data.json.

callumparr commented 3 years ago

Oh cool, that's an impressive reduction. Is that with the --genome added or still considering all transcripts?

ploy-np commented 3 years ago

I tried with the --genome flag, but the performance should be the same without this flag. This was done with the parameters --readcount_min 15 --readcount_max 1000, which is sufficient for xpore-diffmod.

ploy-np commented 3 years ago

Noted that even with the --genome flag, the software still considers all transcripts.