ga4gh / vrs

Extensible specification for representing and uniquely identifying biological sequence variation
https://vrs.ga4gh.org
Apache License 2.0
79 stars 32 forks source link

_id has some reserved use in database(s) #168

Closed mbaudis closed 5 years ago

mbaudis commented 5 years ago

The _id attribute has some specific use e.g. in MongoDB, where it is the auto-generated housekeeping identifier object (including time stamp, auto-indexed etc). Using an attribute named _id would override/block this object. There may be other implementations w/ similar clashes.

While the VR specification may not have to accommodate for such specific implementation issues (?) this case should be considered/discussed.

Also, in SchemaBlocks, Phenopackets objects the general use is id (see e.g. its use in combination with CURIEs in ExternalReference ({S}[B], PXF).

larrybabb commented 5 years ago

@mbaudis thank you for bringing attention to this. I believe it would be nearly impossible to avoid all name conflicts with various technologies and implementations. In the spirit of a messaging or transport model for which this spec is designed, the impetus to deal with various translations and transformations falls on the implementers.

If there is a specific technical challenge or hurdle that can be described more clearly, then please clarify so that we can consider and discuss the weight and impact of our design decisions on the implementers.

mbaudis commented 5 years ago

@larrybabb As you say, transformation for the different implementations will be needed anyway.

But I wonder if there was a specific reasoning behind the underscore-id, which e.g. isn't used in Phenopackets or the original GA4GH schema. This is coming from my special interest in having a "harmonized" way in schema design across different projects and to have as many cross-cutting standards/elements as possible, documented in SchemaBlocks.

So maybe re-think for the next revision :-)

larrybabb commented 5 years ago

I'll let @reece weigh in. There is very good reasoning for the underscore-id. Please refer to this section of the specification for details. In essence, it is an optional field and indicates that it is not a field that should be used when serializing the object for the purpose of digest generation. This decision involved many hours of discussion, review and revision over several months.

If there are specifications, constraints or restrictions that schemablocks has defined then we certainly would like to understand how this work is related and will be impacted by them.

mbaudis commented 5 years ago

Well, I don't think that underscore use for optional arguments is the best solution (or any "format of attribute name" implication); e.g. in JSON Schema one defines the required properties in a root attribute of the schema. However, dialect discussions can be endless; and I just wanted to put this up for future consideration...

Regarding the SchemaBlocks direction: Formats are right now still evolving, with the core agreement being to use a consistent JSON Schema format. But an interesting example for interaction, with respect to VR, could e.g. the re-use of the CURIE definition - we use them everywhere for id definitions, but currently as "string" with documentation describing the form. We'll probably may pick up your example & generate a primitive based on the VR blueprint, and use this then for referencing ... I'll let you know when we have something to show - anyway, this is a typical example for the envisioned cross-project alignments. Pinging @MKonopko.

MKonopko commented 5 years ago

@mbaudis are you asking me to put you in touch with the appropriate VR folks? It would be @reece and @larrybabb and @ahwagner

reece commented 5 years ago

Narrow question of Mongo

I'm no mongo expert, but I don't see the problem. Mongo seems to be able to insert an allele taken straight from the docs and find that allele based on id. Am I missing something?

It seems to me that it's advantageous that mongo will use a computed identifier as its internal identifier.

> db.allele.insert({   "_id": "ga4gh:VA.n9ax-9x6gOC0OEt73VMYqCBfqfxG1XUH",   "location": {     "interval": {       "end": 32936732,       "start": 32936731,       "type": "SimpleInterval"     },     "sequence_id": "ga4gh:SQ._0wi-qoDrvram155UmcSC-zA5ZK4fpLT",     "type": "SequenceLocation"   },   "state": {     "sequence": "C",     "type": "SequenceState"   },   "type": "Allele" })
WriteResult({ "nInserted" : 1 })

> db.allele.find({"_id" : "ga4gh:VA.n9ax-9x6gOC0OEt73VMYqCBfqfxG1XUH"})
{ "_id" : "ga4gh:VA.n9ax-9x6gOC0OEt73VMYqCBfqfxG1XUH", "location" : { "interval" : { "end" : 32936732, "start" : 32936731, "type" : "SimpleInterval" }, "sequence_id" : "ga4gh:SQ._0wi-qoDrvram155UmcSC-zA5ZK4fpLT", "type" : "SequenceLocation" }, "state" : { "sequence" : "C", "type" : "SequenceState" }, "type" : "Allele" }

Why underscore / why not id?

In a perfect world, all tools for all target languages would support similar functionality that we could rely on when writing a spec like VR. That's fairyland.

Not all JSON Schema parsers expose the required attributes. Without that, it would be impossible for anyone to implement the binary serialization required for computed identifiers. Thus, we chose underscore as a way to denote optional fields that should be excluded from serialization.

We considered id. It would work equally well to modify the serialization spec to exclude the id attribute by name. Instead, we preferred underscore for the reasons you've already read.

In my opinion, we made the better decision given the constraints we faced.

Technical Vision

GA4GH is a collection of (at best) very loosely connected efforts. I applaud efforts, including yours with schema blocks, to create a coherent technical vision. However, such a vision thus far remains elusive for us. It's hard enough to be consistent with external and internal approved standards, let alone nascent and unapproved ones (like schema blocks). When schema blocks matures and is approved, all products should consider whether and how to adapt to it. Until then, we need to be adaptable and recognize that we're all doing our best to converge to an interoperable system.

mbaudis commented 5 years ago

Sorry, @reece, your Mongo example exactly shows what what I mean: You are overriding the internal _id housekeeping object with a -by itself perfectly legit - value. This is probably a design error (feature?) in MongoDB itself; I just use this to highlight a potential problem arising from using "specially named parameters".

Your example shows that the replacement for the standard _id value works, since you can override it (I do this all the time, in a controlled environment). Which is bad. A user of the schema in MongoDB would replace the housekeeping "_id" : ObjectId("5bab576a727983b2e00b8d32") of a type ObjectId with a different value. Which is fine if you do it consciously, but dangerous if it is done by just using a schema blueprint. So there goes your example.

But I've mentioned in 2x posts above that the _id in MongoDB isn't that much of an issue, since each schema should be treated independently of a specific implementation. It is just an example of where an inferred specific use of an attribute in a specification clashes with an inferred specific use of an attribute in an implementation.

Why not using underscore

One should not use special name structures to impose specific meaning/scope, if there are other ways to achieve this goal. Yes, I use underscore to indicate scope-private subroutines; but these aren't supposed to be reused.

No, we should not look at "not all JSON Schema parsers expose the required attributes". Parsers aren't standards.

That being said & re-iterating previous comments: There is no specific problem using _id with your intentions in the limited context of the VR spec. But this is a broader view, considering an emerging GA4GH standards landscape - which you should consider.

Broader vision

Regarding larger harmonization approaches, I cannot interfere with specific projects' design processes. However, I - and others- will, occasionally, make the individual projects aware of general/principle/easy-to-fix issues, if becoming aware of them. And this may become part of the SC review process (I hope).

And, yes, the SchemaBlocks mission statement says

Work streams will continue to create standards proposals and their own coherent project implementations, but will work with the SchemaBlocks group to write the Blocks that will come from their own work and are considered of overarching use. Generally, primary work stream and driver project outputs will live in their own spaces outside of SchemaBlocks, with shareable, mature elements - code, documentation, implementation snapshots being represented in {S}[B].

{S}[B] is not something you "adopt"; it is something you use to document your work and align it with other GA4GH projects. Generally, and forward looking, It may be good to interact a bit more beyond your specific project. There will be help, leveraging your expertise to a broader level.

Pinging @ahwagner @andrewyatz, too.

reece commented 5 years ago

I'll reopen here if people want to comment.

mbaudis commented 5 years ago

@reece +1

andrewyatz commented 5 years ago

I'm still unclear on what the actual impact of the use-case brought up here is. Again I think it's my own lack of experience with Mongo. Does mongo expect it can control this field? If it cannot what's the impact? Also are we confusing internal storage for external transmission format i.e. should we even care because we don't intend the schema to be used in databases such as Mongo

mbaudis commented 5 years ago

@andrewyatz The MongoDB example was just used to highlight that overloading attribute names with implicit special meaning may not be the best option. And that e.g. moving to JSON Schema would a) provide a clean solution and b) align with SchemaBlocks. So this was just intended to provide some hints for possible future re-shaping (the spec will get more complex anyway); not as an immediate call for action.

andrewyatz commented 5 years ago

VR is already represented in JSON schema. Are you referring to the fact that VR hasn't reused existing blocks/schema from SchemaBlocks? If so then future steps could be to align closer as suggested and as you've said it's not an immediate call for action.

As for the use of _ to denote internal/non-serialisable fields We've encountered an unfortunate situation with a popular tech solution. I agree with @reece that this is the best compromise given the constraints.