Closed corneliusroemer closed 1 year ago
Indeed I hope that a format like this will become popular in genomic epidemiology!
I agree having the reference and the samples in the same file makes usually things easier - I'll think about it and ask around, but I'll probably do it!
Now both formats are allowed, with the reference in the input file, or in a separate file.
I like the proposed maple format. It compresses better than
.gz
while remaining human readable. It could serve as efficient alternative input for tools like github.com/lenaschimmel/sc2rfOne thing missing to make it a lossless compression format is that the reference is apparently not explicitly included in the Maple file.
I would propose that the reference be included as the first sequence by default. One would need to find a magic name for it that doesn't conflict with any potential sequence names.
I could well imagine using the maple format as output of aligners like Nextalign. But not without inclusion of the reference in the file itself.
Would be fun to write a CLI utility like
xz
to compress/uncompress to maple. That would help adoption.