Closed griffie closed 3 years ago
Awaiting final client feedback on this one before implementing change.
UPDATE: We need 2 new fields to be added which are not in the CanCOGeN template:
study_id The study_id will be filled in by the provider, the field just needs to be provided in the export.
fasta header name The value in the "isolate" field should copy over to the "fasta header name".
We need to break up the numerical/date fields for a few fields. New fields required are: sample collection date null reason host age null reason diagnostic pcr Ct value null reason
How data should be transformed from the CanCOGeN template to the VirusSeq Data Portal export:
The export should deliver the fields in the following order: study_id specimen collector sample ID gisaid accession sample collected by sequence submitted by sample collection date sample collection date null reason geo_loc_name (country) geo_loc_name (state/province/territory) organism isolate fasta header name purpose of sampling purpose of sampling details anatomical material anatomical part body product environmental material environmental site collection device collection method host (scientific name) host disease host age host age null reason host age bin host age unit host gender purpose of sequencing purpose of sequencing details sequencing instrument sequencing protocol consensus sequence software name consensus sequence software version raw sequence data processing method dehosting method breadth of coverage depth of coverage reference genome accession bioinformatics protocol gene name diagnostic pcr Ct value diagnostic pcr ct value null reason
Thanks!
It looks like this is the current list of fields from their API. I presume we align with their field names & case, but do ordering as you have done above?
study_id specimen collector sample ID sample collected by sequence submitted by sample collection date sample collection date null reason geo_loc_name (country) geo_loc_name (state/province/territory) organism isolate fasta header name purpose of sampling purpose of sampling details anatomical material anatomical part body product environmental material environmental site collection device collection method host (scientific name) host disease host age host age null reason host age unit host age bin host gender purpose of sequencing purpose of sequencing details sequencing instrument sequencing protocol raw sequence data processing method dehosting method consensus sequence software name consensus sequence software version breadth of coverage value depth of coverage value reference genome accession bioinformatics protocol gene name diagnostic pcr Ct value GISAID accession diagnostic pcr Ct value null reason
And I can see their end is case sensitive to the above.
The comparison:
P.s. ordering on Viruseq side doesn't matter in terms of uploading.
Hmmm, their values should have mirrored our fields so it shouldn't be breadth of coverage value it should be breadth of coverage, but it's not worth getting them to change it and causing more delay at this point.
You're right, they are very particular about capitalization and everything has to be exactly as they prescribe. I've taken a look and all the fields are there in their API list so let's rock with they way they have it set out.
So this is mostly implemented on my dev version. One thing: VirusSeq "diagnostic pcr Ct value" is supposed receives data from CanCoGen "diagnostic pcr Ct value 1". VirusSeq has a field "diagnostic pcr Ct value null reason", but CanCoGen "diagnostic pcr Ct value 1" has no null value list. Should we still pick out values like Missing, etc. that might occur in "diagnostic pcr Ct value 1" and place them in "diagnostic pcr Ct value null reason"?
Right.
Ok, so we will assume that if they are not providing a CT value that they will not provide a gene name, so any null value found in "gene name" should be mapped to "diagnostic pcr Ct value null reason".
That work?
"diagnostic pcr Ct value null reason" was set up to strip text from "diagnostic pcr Ct value 1" in order to prevent database errors, right? So primary function is to remove words like Missing from it, if user entered them by hand I guess. I wouldn't want "gene name" dataStatus stuff showing in it. The other possibility is that VirusSeq "diagnostic pcr Ct value" isn't needed if no dataStatus values are present there?
Maybe a "gene name null reason" is needed instead?
I believe what we decided is that if a field is not compatible with string values used to specify reason for missing values, then the portal will create a separate field to accommodate that. So in the case of null value for “diagnostic pcr Ct value”, the field will have the reason for null in the DH and then the field will be split into 0 for the “diagnostic pcr Ct value” and the acceptable missing values will be put in the "diagnostic pcr Ct value null reason". For fields that can accommodate missing values such as gene name, the field will either have the proper gene name or the acceptable missing values.
From: Damion Dooley @.> Reply-To: Public-Health-Bioinformatics/DataHarmonizer @.> Date: Monday, August 30, 2021 at 6:39 PM To: Public-Health-Bioinformatics/DataHarmonizer @.> Cc: Subscribed @.> Subject: Re: [Public-Health-Bioinformatics/DataHarmonizer] CanCOGeN template export to VirusSeq Data Portal format (#199)
"diagnostic pcr Ct value null reason" was set up to strip text from "diagnostic pcr Ct value 1" in order to prevent database errors, right? So primary function is to remove words like Missing from it, if user entered them by hand I guess. I wouldn't want "gene name" dataStatus stuff showing in it. The other possibility is that VirusSeq "diagnostic pcr Ct value" isn't needed if no dataStatus values are present there?
Maybe a "gene name null reason" is needed instead?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/Public-Health-Bioinformatics/DataHarmonizer/issues/199#issuecomment-908827228, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAKZJCUZ7E24LIDLECPWYU3T7QXEPANCNFSM5BDI66IQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
OK, so I just road-tested what you've got on the Master branch and everything looks great so far.
For the Ct value issue, can we make any numerical value stay in the "diagnostic pcr Ct value" field (decimal most likely but possible a whole number), and any textual values go into "diagnostic pcr Ct value null reason"? (We don't provide a picklist of standard null values in the DH for this field because the field is optional and curators can leave this blank for the national database, but people may type null values in.)
If there's nothing entered "diagnostic pcr Ct value 1", then "diagnostic pcr Ct value" and "diagnostic pcr Ct value" should remain empty.
I was trying to be fancy before and I think I was overcomplicating things. The Portal doesn't want empty fields, but that's why we want curate the data (and add any missing null values) before it goes live, so let's not worry about it.
That sound good, Dames?
I have adjusted Data Harmonizer so that any "null reason" field, if it is about a numeric field, will take on any textual values in the numeric field (i.e. anything that can't be interpreted as a number). That is now in master branch. This shifting of data only occurs on export from CanCoGen to the VirusSeq portal.
I just tested it and it worked like a charm! Can we do a new release?
Will do!
In the VirusSeq Data Portal export, numerical values (decimals, dates, integers etc) must be separated from categorial null values so that submissions can be validated. We need to create a few new fields for some additional null value fields and upon export, the DH will need to shunt numerical values into the appropriate field and null values into the null value field.
Can we have these 3 new fields to incorporate the null values in the VirusSeq export?
"diagnostic pcr ct value null reason"
"breadth of coverage null reason"
"depth of coverage null reason"
Can we also add a new field called "fasta header name"? It should be a string and should come from the "isolate" field in the CanCOGeN template.
Can we also have a new field be added in the export called "study_id"? This field should be left blank upon export and will be filled by the data submitter.
The export should deliver the fields in the following order: study_id specimen collector sample ID gisaid accession sample collected by sequence submitted by sample collection date sample collection date null reason geo_loc_name (country) geo_loc_name (state/province/territory) organism isolate fasta header name purpose of sampling purpose of sampling details anatomical material anatomical part body product environmental material environmental site collection device collection method host (scientific name) host disease host age host age null reason host age bin host age unit host gender purpose of sequencing purpose of sequencing details sequencing instrument sequencing protocol consensus sequence software name consensus sequence software version raw sequence data processing method dehosting method gene name diagnostic pcr Ct value diagnostic pcr ct value null reason breadth of coverage value breadth of coverage null reason depth of coverage value depth of coverage null reason reference genome accession bioinformatics protocol
Thanks!