Raw (original) data record for all data schemas

pgrosu commented 10 years ago

Hi Everyone,

This is just a friendly suggestion, but could we have in each of the (non-method) data schemas (i.e. reads.avdl, variants.avdl, etc.), a record as follows that stores the URI to a the raw file. I have experienced too many times schemas dramatically change after a significant period of time, where portions of the data were later deemed important and access to the original raw files was needed - I won't even mention what it took to update the schema with the additional data. This way each read, variant, readgroup, etc. can reference their associated one. Below is a suggested record:

record GAOriginalData {

  /* This is an ID to use in a data record to reference the original raw data */  
  string ID;

  /* This stores a link to the original data - this can be an FTP, Google Cloud, etc. */
  string URI;

}

Thanks, Paul

adamnovak commented 10 years ago

That looks like a good thing for common.avdl. Instead of GAOriginalData maybe it should be something like GAImportedFrom, to make it clear that we want a file that is semantically the same as all the data presented through the API, if available. Or in the case of an in-the-cloud aligner would we want it to point to the raw FASTQ?

Do we want to have a full provenance-tracking system?

On Thu, May 29, 2014 at 8:33 AM, Paul Grosu notifications@github.com wrote:

Hi Everyone,

This is just a friendly suggestion, but could we have in each of the (non-method) data schemas (i.e. reads.avdl, variants.avdl, etc.), a record as follows that stores the URI to a the raw file. I have experienced too many times schemas dramatically change after a significant period of time, where portions of the data were later deemed important and access to the original raw files was needed - I won't even mention what it took to update the schema with the additional data. This way each read, variant, readgroup, etc. can reference their associated one. Below is a suggested record:

record GAOriginalData {

/* This is an ID to use in a data record to reference the original raw data */ string ID;

/* This stores a link to the original data - this can be an FTP, Google Cloud, etc. */ string URI;

}

Thanks, Paul

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/71.

richarddurbin commented 10 years ago

This is related to the sourceURI field I put into ReferenceSequence and ReferenceSequenceSet. If this is adopted I could adjust those to point to one of these records. But why have an intermediate id, rather than just have a string sourceURI field in relevant primary record types?

Also, though I support doing this where possible, I think that we will move to allow writing at least some things in the interface, for which there will not be a URI available.

Richard

On 29 May 2014, at 18:19, adamnovak notifications@github.com wrote:

That looks like a good thing for common.avdl. Instead of GAOriginalData maybe it should be something like GAImportedFrom, to make it clear that we want a file that is semantically the same as all the data presented through the API, if available. Or in the case of an in-the-cloud aligner would we want it to point to the raw FASTQ?

Do we want to have a full provenance-tracking system?

On Thu, May 29, 2014 at 8:33 AM, Paul Grosu notifications@github.com wrote:

Hi Everyone,

This is just a friendly suggestion, but could we have in each of the (non-method) data schemas (i.e. reads.avdl, variants.avdl, etc.), a record as follows that stores the URI to a the raw file. I have experienced too many times schemas dramatically change after a significant period of time, where portions of the data were later deemed important and access to the original raw files was needed - I won't even mention what it took to update the schema with the additional data. This way each read, variant, readgroup, etc. can reference their associated one. Below is a suggested record:

record GAOriginalData {

/* This is an ID to use in a data record to reference the original raw data */ string ID;

/* This stores a link to the original data - this can be an FTP, Google Cloud, etc. */ string URI;

}

Thanks, Paul

— Reply to this email directly or view it on GitHub https://github.com/ga4gh/schemas/issues/71.

— Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

pgrosu commented 10 years ago

I guess creating a new issue worked, and just took a little time to show up :)

@adamnovak, so I will expand on each:

Regarding GAOriginalData vs. GAImportedFrom, I have no preference, as long it is general enough that people understand what raw data file(s) were used to populate a specific set of records. Sometimes the link might point to a collection of data files - and the associated metadata - since there can be both BAM and their derived FASTQ files. If it was an in-the-cloud aligner, the FASTQ files and if possible the settings used would be helpful.
Of course one would want to know how those BAM or BCF files were generated and having a full provenance-tracking system is ideal, but may be asking too much for now. It probably is being taken care of the Metadata team, which would provide confidence in the dataset on which the read and variant records are populated from.
Having it in common.avdl makes a lot of sense, superclasses are usually my favorites :)

@richarddurbin, so I will expand on each:

So looking back at #66, I understood that sourceURI meant just the reference sequence but as you mentioned that having it be updated for something like this would help.
Regarding having an id vs. just storing a string for each record, would allow multiple reads for instance to point to the same file or collection of files, and passing an id is cheaper than a string or array of strings. For instance, let's say we have the following:

array<string> URI;

This would provide redundancy for the same dataset. If the same data is available in multiple places, it would ensure that the user can get it from at least one place. If one of the links is the publication, then that would give context to how it was generated, validation in the way it was prepared and contact information.

This is just in addition to all the nice work that is going on right now, but having dealt with "I am about to submit for publication, where is the raw data?" or "I forgot to look for reads that were mapped to multiple reference sequences." I agree that the interface is most important to enable the next-generation of large-scale sequence analysis, but this will just give the ability to properly update/migrate extra information if deemed important into existing records at a future time, which would have already been populated at a cloud-storage location such as Google.

dglazer commented 10 years ago

Closing this issue, at least for now, as part of tidying up for the imminent 0.5 release. We can re-open if and when it becomes a pressing need.

ga4gh / ga4gh-schemas

Raw (original) data record for all data schemas #71