COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

"null" country in some INSDC submissions #212

Closed AngieHinrichs closed 9 months ago

AngieHinrichs commented 1 year ago

Not sure whether this is from your pipeline, but even if not, I bet you can find the right people's contact info quicker than I can.

I've come across several INSDC (ENA/GenBank/DDBJ) records with typical COG-UK comments like "COG_ACCESSION:COG-UK/NORT-YNBHPM9/NORT:2022-02-15_NB552678_NORTNXT102_HHFTT AFX3; COG_BASIC_QC:PASS; COG_HIGH_QC:PASS; COG_NOTE: ..." but "null" as the country within the UK:

FEATURES             Location/Qualifiers
     source          1..29850
...
                     /country="United Kingdom:null"

OW424753 OW469194 OW470260 OW472364 OW477906 OW504773 OW506810 OW513382 OW518352

Unfortunately a grep in data downloaded from NCBI shows >45k of those. Breakdown by center:

  27254 NORT
   6077 NORW
   4769 ARCH
   3508 SCOT
   2409 LOND
   1146 CVR
     19 BHRT

If it would help I can make & attach a file of INSDC accessions and COG-UK IDs.

nickloman commented 1 year ago

Thanks Angie, we will take a look at this and see what the cause is and whether it can be easily corrected.

AngieHinrichs commented 1 year ago

Thank you Nick!

BioWilko commented 1 year ago

Hi Angie, this is a case of us having updated the original biosamples (e.g the first one you linked) but these changes not being reflected in the flatfile you were looking at...

I've messaged a contact at ENA to discuss looking at these again, fingers crossed they will be fixed quickly!

AngieHinrichs commented 11 months ago

Spot-checking the ones I pasted above, those have been fixed. 🎉

No idea whether you've got time to look at these, but I still get 6419 "United Kingdom:null" countries in the latest data downloaded from NCBI, which is a lot better than 45k! And at least some of them also have "United Kingdom:null" in the ENA, for example:

https://www.ebi.ac.uk/ena/browser/view/OW402334 https://www.ebi.ac.uk/ena/browser/view/OW402658 https://www.ebi.ac.uk/ena/browser/view/OW402685 https://www.ebi.ac.uk/ena/browser/view/OW529587

BioWilko commented 11 months ago

Hi Angie, having a quick look at these and as far as I can tell the sample records on ENA don't have null in any of those cases?

https://www.ebi.ac.uk/ena/browser/view/SAMEA13965645 https://www.ebi.ac.uk/ena/browser/view/SAMEA13965756 https://www.ebi.ac.uk/ena/browser/view/SAMEA13966428 https://www.ebi.ac.uk/ena/browser/view/SAMEA13965503

It does seem like some of the flatfiles still need updating though! Would you be able to share the list of IDs so I can pass them onto a contact at ENA?

AngieHinrichs commented 11 months ago

OK, good that the BioSample records don't have the nulls. Here's a TSV with the nucleotide accessions (OW*) and their corresponding BioSample accessions: nullInNucleotide.tsv.gz

Thanks!

BioWilko commented 11 months ago

Perfect cheers angie, I've sent this over to ENA so hopefully this will be sorted soon!

BioWilko commented 10 months ago

I've just had a message from ENA saying that these should all be resolved, how does it look to you @AngieHinrichs ?

AngieHinrichs commented 9 months ago

Looks like they're all fixed now! 🎉 Thanks @BioWilko!

AngieHinrichs commented 9 months ago

Oh dear, once NCBI Virus and NCBI Datasets had propagated the ENA -> GenBank updates through to where I download files, my grep came up almost clean but for two stragglers:

https://www.ebi.ac.uk/ena/browser/view/OW416958 (SAMEA13964883) https://www.ebi.ac.uk/ena/browser/view/OW462206 (SAMEA14003514)

OW416958 also has an incorrect sample collection date (2020 instead of BioSample's 2022-03-09).

Not at all urgent!