Closed callumparr closed 3 years ago
We have improved the xpore-dataprep script to run faster and require less storage space. But it's in the "fastdataprep" branch. In my experiment, with the new script, it took 4 hours using 1 CPU, 5GB memory for 890,267 reads. For the outputs, it first processed 196GB nanopolish_eventalign.txt into 26GB nanopolish.combine then created 3.5 GB data.json.
Oh cool, that's an impressive reduction. Is that with the --genome added or still considering all transcripts?
I tried with the --genome flag, but the performance should be the same without this flag. This was done with the parameters --readcount_min 15 --readcount_max 1000, which is sufficient for xpore-diffmod.
Noted that even with the --genome flag, the software still considers all transcripts.
Do you have any rough estimate for run time and also expected file size.
My hdf5 is generated and round 100Gb but the data.json file has been running for 2 days and is creeping up to 600Gb. Eventalign file from nanopolish is close to 300Gb. I am little worried what the sys.admin will say. Ideally I like to keep the files for later if I need to run diff.mod again. Do you normally reduce the number of reads to consider per transcript?
Library size is around 2M total reads and around 1.5M passed basecalled reads.
Is this expected file sizes?