It is sometimes useful to combine headers (particularly the reference sequence dictionaries) in order to write alignments from multiple input files to a single output file.
Suggestions for validations from @nh13 :
[ ] HD line is the same
[ ] check sort order (both SO line, and the output order of alignments)
[ ] merge read groups
[ ] concatenate sequence dictionaries, ensuring uniqueness of names
[ ] should we check for synonyms, e.g. chr1 vs 1?
[ ] report sequences in a sane order, e.g. custom or engineered contigs after reference genome contigs
[ ] handle collisions if both alignment files include the same reference sequence with different metadata
[ ] merge program groups
[ ] merge comments (concatenate; maybe tag with source file?)
It might be useful to see what others (reference impls) have done to merge headers and seqdicts, as well as tools themselves (e.g. samtools merge). You may
htsjdk: here (see mergeSequenceDictionaries, mergeSequences, mergeReadGroups, and mergeProgramGroups)
It's also worth noting that except in very limited cases you can't just merge the headers without then transforming all the reads too. E.g.
I'm fairly sure if you write an AlignedSegment to an output file without resetting the header on it, it will prefer the stored contig index vs. the correct index in the merged header
What if two headers both have a read-group with ID:A?
What if two headers both have a PG line with the same ID that is referenced on records?
It is sometimes useful to combine headers (particularly the reference sequence dictionaries) in order to write alignments from multiple input files to a single output file.
Suggestions for validations from @nh13 :
HD
line is the sameSO
line, and the output order of alignments)chr1
vs1
?