galaxyproject / tools-iuc

Tool Shed repositories maintained by the Intergalactic Utilities Commission
https://galaxyproject.org/iuc
MIT License
161 stars 417 forks source link

make sure most used tools support compressed input if possible #2312

Open martenson opened 5 years ago

martenson commented 5 years ago

The first column is a number of times the given tool triggered implicit conversion (decompression) of a compressed fastq dataset in last ~6 months on Main.

If we could ensure the tools can work with compressed fastq files we would save a lot of quota for the users of these tools.

 4056 | toolshed.g2.bx.psu.edu/repos/nate/trinity_psc/trinity_psc/0.0.1
 3850 | toolshed.g2.bx.psu.edu/repos/lparsons/fastq_join/fastq_join/1.1.2-806.1
 3600 | toolshed.g2.bx.psu.edu/repos/devteam/tophat2/tophat2/2.1.1
 3329 | toolshed.g2.bx.psu.edu/repos/iuc/pear/iuc_pear/0.9.6.1
 3079 | toolshed.g2.bx.psu.edu/repos/devteam/fastq_groomer/fastq_groomer/1.0.4
 3074 | toolshed.g2.bx.psu.edu/repos/devteam/bwa_wrappers/bwa_wrapper/1.2.3
 2827 | toolshed.g2.bx.psu.edu/repos/devteam/bowtie_wrappers/bowtie_wrapper/1.1.2
 1884 | toolshed.g2.bx.psu.edu/repos/devteam/fastx_barcode_splitter/cshl_fastx_barcode_splitter/1.0.1
 1536 | toolshed.g2.bx.psu.edu/repos/iuc/mothur_make_contigs/mothur_make_contigs/1.39.5.0
 1228 | toolshed.g2.bx.psu.edu/repos/devteam/tophat2/tophat2/0.6
 1226 | toolshed.g2.bx.psu.edu/repos/iuc/seqtk/seqtk_sample/1.2.0
  952 | toolshed.g2.bx.psu.edu/repos/devteam/fastq_trimmer_by_quality/fastq_quality_trimmer/1.0.0
  772 | toolshed.g2.bx.psu.edu/repos/iuc/flash/flash/1.2.11.3
  588 | toolshed.g2.bx.psu.edu/repos/devteam/tophat2/tophat2/2.1.0
  490 | toolshed.g2.bx.psu.edu/repos/devteam/kraken/kraken/1.2.3
  436 | toolshed.g2.bx.psu.edu/repos/devteam/fastq_trimmer/fastq_trimmer/1.0.0
  419 | toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_FastqToSam/2.18.2.1
  373 | toolshed.g2.bx.psu.edu/repos/pjbriggs/trimmomatic/trimmomatic/0.36.5
  361 | toolshed.g2.bx.psu.edu/repos/devteam/tophat2/tophat2/0.9
  297 | toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.1
  243 | toolshed.g2.bx.psu.edu/repos/devteam/fastx_collapser/cshl_fastx_collapser/1.0.0
  241 | toolshed.g2.bx.psu.edu/repos/iuc/kallisto_pseudo/kallisto_pseudo/0.43.1.1
  230 | toolshed.g2.bx.psu.edu/repos/devteam/fastx_trimmer/cshl_fastx_trimmer/1.0.0
  220 | toolshed.g2.bx.psu.edu/repos/devteam/fastx_artifacts_filter/cshl_fastx_artifacts_filter/1.0.0

edit: added the full list of top converting causes

martenson commented 5 years ago

I went through all of the above, the latest versions of the following tools support gz input:

martenson commented 5 years ago

this leaves us with

4056 | toolshed.g2.bx.psu.edu/repos/nate/trinity_psc/trinity_psc/0.0.1
3850 | toolshed.g2.bx.psu.edu/repos/lparsons/fastq_join/fastq_join/1.1.2-806.1
3600 | toolshed.g2.bx.psu.edu/repos/devteam/tophat2/tophat2/2.1.1
3329 | toolshed.g2.bx.psu.edu/repos/iuc/pear/iuc_pear/0.9.6.1
3074 | toolshed.g2.bx.psu.edu/repos/devteam/bwa_wrappers/bwa_wrapper/1.2.3
2827 | toolshed.g2.bx.psu.edu/repos/devteam/bowtie_wrappers/bowtie_wrapper/1.1.2
1536 | toolshed.g2.bx.psu.edu/repos/iuc/mothur_make_contigs/mothur_make_contigs/1.39.5.0
1228 | toolshed.g2.bx.psu.edu/repos/devteam/tophat2/tophat2/0.6
772 | toolshed.g2.bx.psu.edu/repos/iuc/flash/flash/1.2.11.3
588 | toolshed.g2.bx.psu.edu/repos/devteam/tophat2/tophat2/2.1.0
490 | toolshed.g2.bx.psu.edu/repos/devteam/kraken/kraken/1.2.3
436 | toolshed.g2.bx.psu.edu/repos/devteam/fastq_trimmer/fastq_trimmer/1.0.0
419 | toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_FastqToSam/2.18.2.1
361 | toolshed.g2.bx.psu.edu/repos/devteam/tophat2/tophat2/0.9
243 | toolshed.g2.bx.psu.edu/repos/devteam/fastx_collapser/cshl_fastx_collapser/1.0.0
241 | toolshed.g2.bx.psu.edu/repos/iuc/kallisto_pseudo/kallisto_pseudo/0.43.1.1
nsoranzo commented 5 years ago
martenson commented 5 years ago

I have added 'deprecated' tags to bwa_wrapper,tophat2 and pear is now hidden, this leaves us with

4056 | toolshed.g2.bx.psu.edu/repos/nate/trinity_psc/trinity_psc/0.0.1
3850 | toolshed.g2.bx.psu.edu/repos/lparsons/fastq_join/fastq_join/1.1.2-806.1
2827 | toolshed.g2.bx.psu.edu/repos/devteam/bowtie_wrappers/bowtie_wrapper/1.1.2
1536 | toolshed.g2.bx.psu.edu/repos/iuc/mothur_make_contigs/mothur_make_contigs/1.39.5.0
772 | toolshed.g2.bx.psu.edu/repos/iuc/flash/flash/1.2.11.3
490 | toolshed.g2.bx.psu.edu/repos/devteam/kraken/kraken/1.2.3
436 | toolshed.g2.bx.psu.edu/repos/devteam/fastq_trimmer/fastq_trimmer/1.0.0
419 | toolshed.g2.bx.psu.edu/repos/devteam/picard/picard_FastqToSam/2.18.2.1
243 | toolshed.g2.bx.psu.edu/repos/devteam/fastx_collapser/cshl_fastx_collapser/1.0.0
241 | toolshed.g2.bx.psu.edu/repos/iuc/kallisto_pseudo/kallisto_pseudo/0.43.1.1
bernt-matthias commented 3 years ago

I think this is quite important since the implicitly uncompressed files unnecessarily take space in user histories.

I started

Then there are 83 tools left in IUC:

egrep 'type="data"' tools -r --include "*xml" | grep fastq | grep -v ".gz" | sort -r  | cut -d":" -f1 | uniq | wc -l

Feel free to reorder the tools by importance or put your name / PR-link next to it if you are working on it.

bernt-matthias commented 3 years ago

One potential problem that I just observer while treating minimap: Cannot index files compressed with gzip, please use bgzip

For this tool test I can simply use bgzip to generate the test data .. but how can we ensure this in the wild?

Maybe for minimap we need to unzip the files on the fly (even if the tool supports zipped input natively).

gregvonkuster commented 3 years ago

@bernt-matthias would you like me to take any of these in any particular order?

bernt-matthias commented 3 years ago

@gregvonkuster help is very welcome .. choose any :) and put your name or a PR link behind it