AnthyG / Z-DNABERT

2 stars 0 forks source link

Fix processing large input files #5

Open AnthyG opened 4 days ago

AnthyG commented 4 days ago

Currently the files get uploaded through the browser to Jupyter Labs where they are kept in-memory. This approach doesn't seem to work for large files (e.g. 20 MB). However, Jupyter Labs has full access to the file system anyway, so the way to go will either be a filepath input or just saying "place your files in this directory".

AnthyG commented 3 days ago

I've added some more nice time displays with tqdm, so it 's now clearer which steps take long.

Plus there's now a way (albeit very rudimentary for now) to save and restore data from the prediction step and from the stitching step. For a 30 MB fasta input file, checking only one sequence direction, this results in respectively a 120 MB file and a 230 MB file, additionally the generated bed file for the example used was only about 33 KB.