bxlab / galaxy-hackathon

Data intensive science for everyone.
https://galaxyproject.org/
Other
7 stars 2 forks source link

Make fastq datasets compressible on the fly. #38

Open pvanheus opened 8 years ago

pvanheus commented 8 years ago

Uncompressed fastq is a huge waste of disk space. So, a proposal:

  1. Datasets should have attributes "compressible", "uncompressible", "compress_to", "uncompress_to" and "compressed" that allow to flag that this type can be compressed and what datatype it would become once compressed.
  2. The API should be extended with a "compress" and "uncompress" method.
  3. Implicit datatype converters should be leveraged to do on-the-fly decompression for tools that need it.
  4. The tool input form should be aware of where compressed datasets can be used where an uncompressed dataset of appropriate type would typically be used.
  5. A new type, FastqGz should be created to support gzipped fastq.
  6. (Optional) uncompressed datasets should be cache.

As discussed with @ashvark and @dannon.

pvanheus commented 8 years ago

Code implicated by this:

  1. lib/galaxy/datatypes/sequence.py and lib/galaxy/datatypes/data.py and lib/galaxy/datatypes/registry.py (for parsing the datatypes_conf.xml)
  2. lib/galaxy/datatypes/converters (there will need to be new fastq->fastqgz and fastqgz->fastq ones)
  3. lib/galaxy/webapps/galaxy/buildapp.py and lib/galaxy/webapps/galaxy/api/datasets.py for an API to compress and uncompress datatypes.
frederikcoppens commented 8 years ago

When linking .fastq.gz files as an admin, the linked files have extension .fastq which caused problems with tools natively supporting gzipped files. Adding the .gz again solved it. So admin upload/linking also needs to be looked at to verify it works with the new datatype

ashvark commented 8 years ago

With the help of @dannon and @pvanheus , I made a little progress in fastq <-> fastq.gz conversion of existing datastets in the histories. I would like to know the PROS and CONS of these changes. You can find these changes in the following branch. https://github.com/ashvark/galaxy/tree/fastqCompression

TODO:

References https://github.com/galaxyproject/galaxy/pull/2535 https://www.e-biogenouest.org/wiki/ManArchiveGalaxy

yhoogstrate commented 8 years ago

+ref: https://github.com/galaxyproject/tools-iuc/pull/354