dcjones / quip

Compressing next-generation sequencing data with extreme prejudice.
http://www.cs.washington.edu/homes/dcjones/quip/
BSD 3-Clause "New" or "Revised" License
78 stars 10 forks source link

Keep reference file #8

Open westerman opened 12 years ago

westerman commented 12 years ago

One problem I have with the BAM/SAM format is that the reference is not kept with the read file. You allude to this in your paper. Keeping old references around, especially for poorly characterized organisms, is problematic. I.e., stating 'HG18' for your reference is fairly safe because that human reference set is likely to be around for the next 10 years or so. However stating 'genome.fa' (as Illumina does for its references) is too generic. Likewise stating 'Lycopersicon esculentum v0.1' is likely to specify a reference that will not be around for more than a year.

Since you are developing the quip format then an option -- not a requirement -- to embed the reference into the quip file would be useful to us people using those little-known genomes. I suspect that you could compress the reference very nicely.

peterjc commented 12 years ago

Although not 'released' yet, the latest SAM/BAM specification on the repository does support optionally embedding the reference (as a special read). This is expected to be of particular interest with de novo assemblies, but as you point out makes sense for any non-model organism reference sequence - and was discussed on the samtools-devel mailing list. See: https://github.com/samtools/sam-spec/blob/master/SAMv1.tex

dcjones commented 12 years ago

That's not terribly hard to implement, whether it be by using SAM's special read trick or a separate mechanism. I'll try to get this in the next version.