ZKI-PH-ImageAnalysis / seq2squiggle

End-to-end simulation of nanopore sequencing signals with feed-forward transformers
MIT License
4 stars 0 forks source link

Multiprocessing with IterableDataset #1

Closed denisbeslic closed 2 months ago

denisbeslic commented 3 months ago

We would need to restructure our IterableDataset class to use multiprocessing for prediction

https://github.com/Lightning-AI/pytorch-lightning/issues/15734 https://colab.research.google.com/drive/1OFLZnX9y5QUFNONuvFsxOizq4M-tFvk-?usp=sharing#scrollTo=dEOL7Qh9C0vM https://assets.ctfassets.net/yze1aysi0225/6j1vzFot8yll1FG6J4Ryis/7a3cbb50869da28faaedd39bdd0d58b8/Speechmatics_Dataloader_Pytorch_Ebook_2019__1_.pdf

denisbeslic commented 2 months ago
denisbeslic commented 2 months ago

For future runtime improvements (using multi-GPU / multi-processing), we would need to restructure the inference pipeline and dataloading. We could try using MapDataset instead of IterableDataset. However, we would need to save the processed chunks of a FASTA file in a tmp file to use MapDataset. Otherwise, we could export the trained model as .pt and perform the inference part in another faster language (C++, Rust).