ga4gh / ga4gh-server

Reference implementation of the APIs defined in ga4gh-schemas. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
96 stars 91 forks source link

Allow "loose" referential integrity #1528

Open david4096 opened 7 years ago

david4096 commented 7 years ago

We shouldn't place a foreign key requirement for references when someone does not have them, or doesn't want to manage them. This means that a VCF should be able to be added with no other data (other than a dataset) present in a server.

The problem becomes that it is unclear what reference names to use to query a variant set. From the perspective of the server, that is a data management problem and the more full-fledged offering can be made by adding a reference set, but it shouldn't be required.

david4096 commented 7 years ago

To close this issue we should make it possible to add variant sets without a reference set added. This is possible because variants use the referenceName in their search request. However, for Reads search, the explicit reference ID is used, so we can't offer the same feature as easily.

One way to still allow a BAM to be query-able without adding a reference would be to create a synthetic reference based on the BAM index or headers and adding it to the registry. VCF headers do not contain enough information to know what references are used, but by reflecting on the tabix index we might be able to do something similar.

For RNA, making the reference set optional presents no real differences in access pattern. I think there are some features that may only be present in specific gene builds, but that relational information is captured by the FeatureSetIDs.

ejacox commented 7 years ago

Here is my view:

Currently, we require that the server has the references loaded. We then use internally generated ids to refer to those references from other sets (tables). The problem is that it could be unrealistic to expect every server to have all the necessary references stored internally. We should be able to use references that are defined outside of the server.

We would still like to maintain referential integrity. It just doesn't need to be enforced by the database. This can be done during ingest, ensuring that all reference ids are known, either internally or externally, much like ontology terms. Alternatively, referential integrity could be checked later or even as part of compliance.

In the short term, we can turn off foreign key checks. Longer term, this should be addressed in the larger discussion involving external ids, ontology terms, federated queries, etc.

david4096 commented 7 years ago

Thanks @ejacox that's a good summary of the situation. The feature is, I don't need a FASTA to load data into the server. The aspiration is maintaining easy access patterns when data are distributed.