Apply compression to common data formats

UPPMAX / irods

Project for implementing an iRODS infrastructure on UPPMAX / SciLifeLab

8 stars 3 forks source link

Apply compression to common data formats #10

Open brainstorm opened 12 years ago

brainstorm commented 12 years ago

.sam files should not be present for more than X months on the filesystem. Automatic conversion to .bam can be performed.

Likewise, unused .fastq files (for a reasonable amount of time), should be compressed to .gz, which many bioinformatic tools support natively.

samuell commented 12 years ago

Very interesting ideas! I wonder though how that fits in with our current plan, that iRODS mostly would handle data when stored on SweStore, while when running analyses, one would check out data as normal files (basically since handling data directly, via iRODS is quite cumbersome).

I think these are the kind of things for which an IRL meeting could help ... to decide on what we should aim for in these regards.

jhagberg commented 12 years ago

Yes I agree. Much what I have thought iRODS can do !

I feel that we really need a design. A concept how to work. How to use iRODS as a help and a tool in the day to day work. Perhaps we need help from other experts in that discussion.

Direct access vault can bee a good way to come around the iget/iput problem. Then just iput the results from analysis and by metadata associate the result with different input files in iRODS and so on.

jhagberg commented 12 years ago

Can you help sketch up the outline of the rules?

I can then write a periodic rule to check for files, bundle them and apply compression or just apply compression and then archive.

brainstorm commented 12 years ago

We will irsync files that follow those globs to uppmax:

https://github.com/SciLifeLab/bcbb/blob/master/nextgen/scripts/illumina_finished_msg.py#L239

Then, a first approach would be to look for uncompressed fastq files within irods (there will be under fastq/ dir) and compress them using gzip, md5summing the resulting file.

We want to have easy access to metadata, so the fastq folder (biggest) should be bundled independently from the lightweight metadata files (*.xml, etc..).