cidgoh / DataHarmonizer

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
94 stars 27 forks source link

CanCOGeN template export to VirusSeq Data Portal format #199

Closed griffie closed 3 years ago

griffie commented 3 years ago

In the VirusSeq Data Portal export, numerical values (decimals, dates, integers etc) must be separated from categorial null values so that submissions can be validated. We need to create a few new fields for some additional null value fields and upon export, the DH will need to shunt numerical values into the appropriate field and null values into the null value field.

Can we have these 3 new fields to incorporate the null values in the VirusSeq export?

  1. "diagnostic pcr ct value null reason"

    • the PCR value (can be decimal) will go to "diagnostic pcr ct value" and the null value will go to "diagnostic pcr ct value null reason"
  2. "breadth of coverage null reason"

    • the breadth coverage value (can be decimal) will go to "breadth of coverage" and the null value will go to "breadth of coverage null reason"
  3. "depth of coverage null reason"

    • the depth coverage value (can be decimal) will go to "depth of coverage" and the null value will go to "depth of coverage null reason"

Can we also add a new field called "fasta header name"? It should be a string and should come from the "isolate" field in the CanCOGeN template.

Can we also have a new field be added in the export called "study_id"? This field should be left blank upon export and will be filled by the data submitter.

The export should deliver the fields in the following order: study_id specimen collector sample ID gisaid accession sample collected by sequence submitted by sample collection date sample collection date null reason geo_loc_name (country) geo_loc_name (state/province/territory) organism isolate fasta header name purpose of sampling purpose of sampling details anatomical material anatomical part body product environmental material environmental site collection device collection method host (scientific name) host disease host age host age null reason host age bin host age unit host gender purpose of sequencing purpose of sequencing details sequencing instrument sequencing protocol consensus sequence software name consensus sequence software version raw sequence data processing method dehosting method gene name diagnostic pcr Ct value diagnostic pcr ct value null reason breadth of coverage value breadth of coverage null reason depth of coverage value depth of coverage null reason reference genome accession bioinformatics protocol

Thanks!

ddooley commented 3 years ago

Awaiting final client feedback on this one before implementing change.

griffie commented 3 years ago

UPDATE: We need 2 new fields to be added which are not in the CanCOGeN template:

  1. study_id The study_id will be filled in by the provider, the field just needs to be provided in the export.

  2. fasta header name The value in the "isolate" field should copy over to the "fasta header name".

We need to break up the numerical/date fields for a few fields. New fields required are: sample collection date null reason host age null reason diagnostic pcr Ct value null reason

How data should be transformed from the CanCOGeN template to the VirusSeq Data Portal export:

  1. sample collection date --> divides into "sample collection date" which contains the actual ISO format dates and "sample collection date null reason" which contains the null values.
  2. host age --> age values go in "host age", null values should go in "host age null reason".
  3. diagnostic pcr Ct value --> CT values go in "diagnostic pcr Ct value" and null values go in "diagnostic pcr Ct value null reason".

The export should deliver the fields in the following order: study_id specimen collector sample ID gisaid accession sample collected by sequence submitted by sample collection date sample collection date null reason geo_loc_name (country) geo_loc_name (state/province/territory) organism isolate fasta header name purpose of sampling purpose of sampling details anatomical material anatomical part body product environmental material environmental site collection device collection method host (scientific name) host disease host age host age null reason host age bin host age unit host gender purpose of sequencing purpose of sequencing details sequencing instrument sequencing protocol consensus sequence software name consensus sequence software version raw sequence data processing method dehosting method breadth of coverage depth of coverage reference genome accession bioinformatics protocol gene name diagnostic pcr Ct value diagnostic pcr ct value null reason

Thanks!

ddooley commented 3 years ago

It looks like this is the current list of fields from their API. I presume we align with their field names & case, but do ordering as you have done above?

study_id specimen collector sample ID sample collected by sequence submitted by sample collection date sample collection date null reason geo_loc_name (country) geo_loc_name (state/province/territory) organism isolate fasta header name purpose of sampling purpose of sampling details anatomical material anatomical part body product environmental material environmental site collection device collection method host (scientific name) host disease host age host age null reason host age unit host age bin host gender purpose of sequencing purpose of sequencing details sequencing instrument sequencing protocol raw sequence data processing method dehosting method consensus sequence software name consensus sequence software version breadth of coverage value depth of coverage value reference genome accession bioinformatics protocol gene name diagnostic pcr Ct value GISAID accession diagnostic pcr Ct value null reason

And I can see their end is case sensitive to the above.

ddooley commented 3 years ago

The comparison: image

ddooley commented 3 years ago

P.s. ordering on Viruseq side doesn't matter in terms of uploading.

griffie commented 3 years ago

Hmmm, their values should have mirrored our fields so it shouldn't be breadth of coverage value it should be breadth of coverage, but it's not worth getting them to change it and causing more delay at this point.

You're right, they are very particular about capitalization and everything has to be exactly as they prescribe. I've taken a look and all the fields are there in their API list so let's rock with they way they have it set out.

ddooley commented 3 years ago

So this is mostly implemented on my dev version. One thing: VirusSeq "diagnostic pcr Ct value" is supposed receives data from CanCoGen "diagnostic pcr Ct value 1". VirusSeq has a field "diagnostic pcr Ct value null reason", but CanCoGen "diagnostic pcr Ct value 1" has no null value list. Should we still pick out values like Missing, etc. that might occur in "diagnostic pcr Ct value 1" and place them in "diagnostic pcr Ct value null reason"?

griffie commented 3 years ago

Right.

Ok, so we will assume that if they are not providing a CT value that they will not provide a gene name, so any null value found in "gene name" should be mapped to "diagnostic pcr Ct value null reason".

That work?

ddooley commented 3 years ago

"diagnostic pcr Ct value null reason" was set up to strip text from "diagnostic pcr Ct value 1" in order to prevent database errors, right? So primary function is to remove words like Missing from it, if user entered them by hand I guess. I wouldn't want "gene name" dataStatus stuff showing in it. The other possibility is that VirusSeq "diagnostic pcr Ct value" isn't needed if no dataStatus values are present there?

Maybe a "gene name null reason" is needed instead?

wwhsiao commented 3 years ago

I believe what we decided is that if a field is not compatible with string values used to specify reason for missing values, then the portal will create a separate field to accommodate that. So in the case of null value for “diagnostic pcr Ct value”, the field will have the reason for null in the DH and then the field will be split into 0 for the “diagnostic pcr Ct value” and the acceptable missing values will be put in the "diagnostic pcr Ct value null reason". For fields that can accommodate missing values such as gene name, the field will either have the proper gene name or the acceptable missing values.

From: Damion Dooley @.> Reply-To: Public-Health-Bioinformatics/DataHarmonizer @.> Date: Monday, August 30, 2021 at 6:39 PM To: Public-Health-Bioinformatics/DataHarmonizer @.> Cc: Subscribed @.> Subject: Re: [Public-Health-Bioinformatics/DataHarmonizer] CanCOGeN template export to VirusSeq Data Portal format (#199)

"diagnostic pcr Ct value null reason" was set up to strip text from "diagnostic pcr Ct value 1" in order to prevent database errors, right? So primary function is to remove words like Missing from it, if user entered them by hand I guess. I wouldn't want "gene name" dataStatus stuff showing in it. The other possibility is that VirusSeq "diagnostic pcr Ct value" isn't needed if no dataStatus values are present there?

Maybe a "gene name null reason" is needed instead?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/Public-Health-Bioinformatics/DataHarmonizer/issues/199#issuecomment-908827228, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAKZJCUZ7E24LIDLECPWYU3T7QXEPANCNFSM5BDI66IQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

griffie commented 3 years ago

OK, so I just road-tested what you've got on the Master branch and everything looks great so far.

For the Ct value issue, can we make any numerical value stay in the "diagnostic pcr Ct value" field (decimal most likely but possible a whole number), and any textual values go into "diagnostic pcr Ct value null reason"? (We don't provide a picklist of standard null values in the DH for this field because the field is optional and curators can leave this blank for the national database, but people may type null values in.)

If there's nothing entered "diagnostic pcr Ct value 1", then "diagnostic pcr Ct value" and "diagnostic pcr Ct value" should remain empty.

I was trying to be fancy before and I think I was overcomplicating things. The Portal doesn't want empty fields, but that's why we want curate the data (and add any missing null values) before it goes live, so let's not worry about it.

That sound good, Dames?

ddooley commented 3 years ago

I have adjusted Data Harmonizer so that any "null reason" field, if it is about a numeric field, will take on any textual values in the numeric field (i.e. anything that can't be interpreted as a number). That is now in master branch. This shifting of data only occurs on export from CanCoGen to the VirusSeq portal.

griffie commented 3 years ago

I just tested it and it worked like a charm! Can we do a new release?

ddooley commented 3 years ago

Will do!