Closed genericdata closed 1 year ago
What bacterial genome are you using? Most bacteria wouldn't have this number of mRNAs or sRNAs. Nevertheless, you can run each sRNA separately (i.e, have a fasta file per sRNA and use these single-sequence sRNAs as input).
Thank you for responding. Indeed the data is not bacterial and the researcher has moved on to another analysis pipeline.
In case performance becomes an issue, the largest performance increase based on my fork was writing to pickle rather than CSVs between each process.
Are there any optimizations when using large files, such as dask, spark, chunking you can suggest? Our 2 source files create a 4.6TB file. mRNA.fa: 123754 lines sRNA.fa: 159990 lines
I see we read all data into memory at once and we don't have the resources to accommodate.