Large fileset - Githubissues

BioinformaticsLabAtMUN / sRNARFTarget

A random forest-based model to predict transcriptome-wide targets of sRNAs.

GNU General Public License v3.0

5 stars 5 forks source link

Large fileset #3

Closed genericdata closed 1 year ago

genericdata commented 1 year ago

Are there any optimizations when using large files, such as dask, spark, chunking you can suggest? Our 2 source files create a 4.6TB file. mRNA.fa: 123754 lines sRNA.fa: 159990 lines

I see we read all data into memory at once and we don't have the resources to accommodate.

BioinformaticsLabAtMUN commented 1 year ago

What bacterial genome are you using? Most bacteria wouldn't have this number of mRNAs or sRNAs. Nevertheless, you can run each sRNA separately (i.e, have a fasta file per sRNA and use these single-sequence sRNAs as input).

genericdata commented 1 year ago

Thank you for responding. Indeed the data is not bacterial and the researcher has moved on to another analysis pipeline.

genericdata commented 1 year ago

In case performance becomes an issue, the largest performance increase based on my fork was writing to pickle rather than CSVs between each process.