ga4gh / ga4gh-schemas

Models and APIs for Genomic data. RETIRED 2018-01-24
http://ga4gh.org
Apache License 2.0
214 stars 114 forks source link

Raw (original) data record for all data schemas #71

Closed pgrosu closed 10 years ago

pgrosu commented 10 years ago

Hi Everyone,

This is just a friendly suggestion, but could we have in each of the (non-method) data schemas (i.e. reads.avdl, variants.avdl, etc.), a record as follows that stores the URI to a the raw file. I have experienced too many times schemas dramatically change after a significant period of time, where portions of the data were later deemed important and access to the original raw files was needed - I won't even mention what it took to update the schema with the additional data. This way each read, variant, readgroup, etc. can reference their associated one. Below is a suggested record:

record GAOriginalData {

  /* This is an ID to use in a data record to reference the original raw data */  
  string ID;

  /* This stores a link to the original data - this can be an FTP, Google Cloud, etc. */
  string URI;

}

Thanks, Paul

adamnovak commented 10 years ago

That looks like a good thing for common.avdl. Instead of GAOriginalData maybe it should be something like GAImportedFrom, to make it clear that we want a file that is semantically the same as all the data presented through the API, if available. Or in the case of an in-the-cloud aligner would we want it to point to the raw FASTQ?

Do we want to have a full provenance-tracking system?

On Thu, May 29, 2014 at 8:33 AM, Paul Grosu notifications@github.com wrote:

Hi Everyone,

This is just a friendly suggestion, but could we have in each of the (non-method) data schemas (i.e. reads.avdl, variants.avdl, etc.), a record as follows that stores the URI to a the raw file. I have experienced too many times schemas dramatically change after a significant period of time, where portions of the data were later deemed important and access to the original raw files was needed - I won't even mention what it took to update the schema with the additional data. This way each read, variant, readgroup, etc. can reference their associated one. Below is a suggested record:

record GAOriginalData {

/* This is an ID to use in a data record to reference the original raw data */ string ID;

/* This stores a link to the original data - this can be an FTP, Google Cloud, etc. */ string URI;

}

Thanks, Paul

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/71.

richarddurbin commented 10 years ago

This is related to the sourceURI field I put into ReferenceSequence and ReferenceSequenceSet. If this is adopted I could adjust those to point to one of these records. But why have an intermediate id, rather than just have a string sourceURI field in relevant primary record types?

Also, though I support doing this where possible, I think that we will move to allow writing at least some things in the interface, for which there will not be a URI available.

Richard

On 29 May 2014, at 18:19, adamnovak notifications@github.com wrote:

That looks like a good thing for common.avdl. Instead of GAOriginalData maybe it should be something like GAImportedFrom, to make it clear that we want a file that is semantically the same as all the data presented through the API, if available. Or in the case of an in-the-cloud aligner would we want it to point to the raw FASTQ?

Do we want to have a full provenance-tracking system?

On Thu, May 29, 2014 at 8:33 AM, Paul Grosu notifications@github.com wrote:

Hi Everyone,

This is just a friendly suggestion, but could we have in each of the (non-method) data schemas (i.e. reads.avdl, variants.avdl, etc.), a record as follows that stores the URI to a the raw file. I have experienced too many times schemas dramatically change after a significant period of time, where portions of the data were later deemed important and access to the original raw files was needed - I won't even mention what it took to update the schema with the additional data. This way each read, variant, readgroup, etc. can reference their associated one. Below is a suggested record:

record GAOriginalData {

/* This is an ID to use in a data record to reference the original raw data */ string ID;

/* This stores a link to the original data - this can be an FTP, Google Cloud, etc. */ string URI;

}

Thanks, Paul

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/71.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

pgrosu commented 10 years ago

I guess creating a new issue worked, and just took a little time to show up :)

@adamnovak, so I will expand on each:

@richarddurbin, so I will expand on each:

array<string> URI;

This would provide redundancy for the same dataset. If the same data is available in multiple places, it would ensure that the user can get it from at least one place. If one of the links is the publication, then that would give context to how it was generated, validation in the way it was prepared and contact information.

dglazer commented 10 years ago

Closing this issue, at least for now, as part of tidying up for the imminent 0.5 release. We can re-open if and when it becomes a pressing need.