ga4gh-schemablocks / ga4gh-schemablocks.github.io

Website of the GA4GH SchemaBlocks Project
The Unlicense
2 stars 6 forks source link

Documenting requirements #12

Open pnrobinson opened 5 years ago

pnrobinson commented 5 years ago

The Phenopacket group is planning to use three categories to denote whether a particular field is required, recommended, or optional. Here is an example: https://phenopackets-schema.readthedocs.io/en/latest/variant.html I am wondering if this needs to be coordinated with SchemaBlocks or if there are any recommendations?

mbaudis commented 5 years ago

@pnrobinson I like that idea a lot; however I don't know how much something like that could even be specified as a specific schema element (AFAIK it isn't part of proto). So, again a Q for @Relequestual. (One could have a separate object for that, but maybe there is a JSON way?)

pnrobinson commented 5 years ago

There is no way of specifying these three categories within protobuf, but we are supplying a Java validation library to implement it.

mbaudis commented 5 years ago

Thought so; I'm definitely pro adopting this systematically.

Re. allele example: Would be good if we could use this also for establishing / using a "GA4GH allele" type (or a specific PXF, Beacon ...) variant; we have the one lifted over & modified from the GA4GH schema, which has some options for structural variants. Would be good to have this moved to an agreed upon standard (modifications and all), as explicit "VCF inherited & documented variant storage & transfer standard"; we need this e.g. for Beacon (responding with matched variants to wildcard queries) & have to move soon on it.

pnrobinson commented 5 years ago

It seems that the variant class from the GA4GH schema has gone a little overboard, and has too many fields that reflect the bioinformatics processing, e.g., mate_name. I would suggest that if a user needs that much detail, then they probably just want to have the FASTQ files and do everything themselves, rather than start from some summary message. But that is just one opinion and it might be good to start off by defining what we think the typical use cases are and what the requirements are?

mbaudis commented 5 years ago

MateName is related to the MateID of VCF structural variants; essential for translocations. Part of next Beacon point release. Easy porting - and querying - of cytogenetic annotation data.

There are more relevant structural changes than SNPs... (not sure about this statement; depending on context... :-) )

pnrobinson commented 5 years ago

Yes, but that is not to say that this is the best way of representing them in these formats. It seems it would be better to abstract away from the VCF format, especially since there is little acceptance of this format in the community for SVs yet (different programs have a range of ways of representing SVs and translocations).

mbaudis commented 5 years ago

Well, I'm a (nearly pure...) SV person; and there is no good format (besides traditional ISCN banding annotations - so my primary method is to abstract from that, obviously accommodating for more resolution ...).

I really don't care about some of the VCF "features" (assuming a static dataset w/ callsets in columns????); but somehow they have put lots of thoughts into representing all (?) crazy types of variants. This is inspirational, regarding some of the representations (e.g. using a concept of fuzzy start, end for SVs, though this could be done more elegantly; acknowledging the need for fusion mapping etc.); but then VCF is a) limited through the static file structure, and b) overly permissive through headers/options (look e.g. at the 1k genomes SV files - custom mess).

But IMO better as a template than HGVS; we do not want to discuss transcript ID etc. based ways to annotate variants for data exchange. Map them or lose them, reference genome or bus (for cross-resource data exchange).

So this is about a robust, reference genome mapping based, SV supporting schema. Which - beyond this here & the related Beacon allele request format, (also based on VCF & GA4GH schema) - IMO doesn't exist (well, ISCN 2016 etc., but that is still based on "Human: Deparse that string!").

So w/ respect to having a separate variant format from the limited ones you list in PXF - Yes, definitely; otherwise it wouldn't have been drafted (& used with >100k samples behind Beacons). But PXF can/should obviously offer different ways to represent variants.

But - Well, up for changes, additions, any time!