Open brainstorm opened 12 years ago
Very interesting ideas! I wonder though how that fits in with our current plan, that iRODS mostly would handle data when stored on SweStore, while when running analyses, one would check out data as normal files (basically since handling data directly, via iRODS is quite cumbersome).
I think these are the kind of things for which an IRL meeting could help ... to decide on what we should aim for in these regards.
Yes I agree. Much what I have thought iRODS can do !
I feel that we really need a design. A concept how to work. How to use iRODS as a help and a tool in the day to day work. Perhaps we need help from other experts in that discussion.
Direct access vault can bee a good way to come around the iget/iput problem. Then just iput the results from analysis and by metadata associate the result with different input files in iRODS and so on.
Can you help sketch up the outline of the rules?
I can then write a periodic rule to check for files, bundle them and apply compression or just apply compression and then archive.
We will irsync files that follow those globs to uppmax:
https://github.com/SciLifeLab/bcbb/blob/master/nextgen/scripts/illumina_finished_msg.py#L239
Then, a first approach would be to look for uncompressed fastq files within irods (there will be under fastq/ dir) and compress them using gzip, md5summing the resulting file.
We want to have easy access to metadata, so the fastq folder (biggest) should be bundled independently from the lightweight metadata files (*.xml, etc..).
.sam files should not be present for more than X months on the filesystem. Automatic conversion to .bam can be performed.
Likewise, unused .fastq files (for a reasonable amount of time), should be compressed to .gz, which many bioinformatic tools support natively.