Consumption of seqcol into existing file formats

andrewyatz commented 3 years ago

Speaking in the VRC/VCF meeting, the consensus was these flat file consumers would work directly with the collection header format rather than working with seqcol serialised into their native header format. The thinking was there is no point spending time encoding to decode from one format to another. Much faster to just consume the native seqcol header and use that. Will mean a breaking change in the formats.

nsheff commented 3 years ago

Can you specify exactly what you mean by the collection header format? Would this be some JSON blob like:

{'names': ['chrUn_KI270742v1',   'chrUn_GL000216v2',   'chrUn_GL000218v1'],
  'lengths': ['186739', '176608', '161147'],
  'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99',   '725009a7e3f5b78752b68afa922c090c',   
'1d708b54644c26c7e01c2dad5426d38c']}

and you're suggesting this blob would appear verbatim in the VCR/VCF file?

I suppose an alternative is that the seqcol API could provide an endpoint that provided this same information in an alternative format that fit existing native header formats.

andrewyatz commented 3 years ago

Sorry I wasn't meaning that. What I meant was libraries would detect the existence of the appropriate header indicating the sequence collection identifier. That library would then request the sequence collection and parse the JSON directly into the appropriate data structures

jmarshall commented 3 years ago

I have not caught up with whether the group considers digest hashes, unhashed concatenated strings for digesting, or JSON blobs 🤮 as the canonical unambiguous representation of a sequence collection… but what I have been envisaging as the item that might appear in the header of a SAM or VCF file is the digest hash. For example, along with an optional informal non-normative non-canonical description to give the SAM/VCF file's human reader a clue as to what's intended:

@SD   SH:2d967306d7b589e32aaf3ed6a63c9dde   VN:1   DS:GRCh38-plus-stuff

##collection=<ID=2d967306d7b589e32aaf3ed6a63c9dde,Version=1,Description="GRCh38-plus-stuff">

See also https://github.com/ga4gh/seqcol-spec/issues/1#issuecomment-741781036 about the possibility of having the unhashed string to be digested embedded in SAM/VCF/etc. But only this digest hash option would help reduce the size of the header in the millions of sequences case, and using the JSON blob would also cause havoc with delimiters when embedded in another non-JSON text format.

See also this related motivational proposal, in particular slide 6.

andrewyatz commented 3 years ago

And the continuation of this is 2d967306d7b589e32aaf3ed6a63c9dde would be passed into a seqcol endpoint and then can return the appropriate payload to be consumed by the library

sveinugu commented 3 years ago

It would be nice to have to possibility of including the seqcol output into track files in the raw form. I see several situations where having self-contained files that include the sequence collection data would be useful, e.g. in secure settings where access to the internet is restricted. This would be mainly useful for including the coordinate system into the file itself, but there are probably usage scenarios also for the other recursion levels. JSON is not the format for that, I agree. Would it be a possibility to define an alternative but canonical output format (without whitespace) for use in tabular files?

ga4gh / refget

Consumption of seqcol into existing file formats #13