HenrikBengtsson / aroma.seq

🔬 R package: aroma.seq: High-Throughput Sequence Analysis using the Aroma Framework
https://github.com/HenrikBengtsson/aroma.seq
0 stars 1 forks source link

REPRODUCIBILITY: Call external tools with <md5 checksum>.<ext> to make file headers stable #25

Open HenrikBengtsson opened 8 years ago

HenrikBengtsson commented 8 years ago

Some external HT-Seq tools stores the "call" string in the file header of the output file. For instance, when aligning a FASTQ file, the BAM read group field @CL stores the command call as a string. In order to maximize the chance for the generated BAM file to be identical (same md5 checksum) for the same input, the @CL string must be the same as well. To achieve this, the call should be made with input files being based on the file checksum of the input files rather than the (original) filename. This can be achieved by using symbol file links.

Question: Is this a good idea or will it make the @CL too hard to interpret.

HenrikBengtsson commented 8 years ago

For the same reason should the binary/executable be called without absolute paths, e.g. by creating a local link such that the path is the current directory.