HadrienG / InSilicoSeq

:rocket: A sequencing simulator
https://insilicoseq.readthedocs.io
MIT License
176 stars 32 forks source link

Write reads to disk immediately instead of caching them in RAM #210

Open sebschmi opened 2 years ago

sebschmi commented 2 years ago

First of all, thank you for making ISS. I find it very fast and easy to use, especially because it ships with error models.

When trying it out I noted that it uses a lot of RAM, which seemed odd for a read simulator, especially since it slowly eats RAM over time. However I think I found the reason and a fix for that.

When generating reads, ISS first stores all reads in a python list in RAM. Only after generating all reads, it writes them to disk.

However, it would be much more memory efficient to write them to disk immediately after generation. So this is what I did. I moved the read generation code into a generator function reads_generator which I pass to to_fastq.

As a result, the memory usage is now small and stays constant during generation.