fedarko / strainFlye

Pipeline for analyzing (rare) mutations in metagenome-assembled genomes
BSD 3-Clause "New" or "Revised" License
8 stars 1 forks source link

Compress TSV outputs? #53

Open fedarko opened 2 years ago

fedarko commented 2 years ago

Using gzip or something. For the SheepGut dataset, running fdr estimate using Everything to produce one TSV file for all possible decoy contexts produces a folder of outputs weighing ~1.7 GB -- each TSV file is about 120 MB (including the number of mutations per Mb TSV file).

Looks like pd.read_csv() supports loading gzipped files (https://stackoverflow.com/a/39264156), so this shouldn't complicate things too much. Although it might make testing a bit more difficult.