COG-UK / dipi-group

Data integrity and pipeline integration working group
4 stars 1 forks source link

Sequences contain question marks #145

Closed Jeltje closed 2 years ago

Jeltje commented 2 years ago

Hi all, please let me know who else to notify if this is not the right channel.

We (at UCSC) download https://cog-uk.s3.climb.ac.uk/phylogenetics/latest/cog_all.fasta on a regular basis and I just noticed that 7,298 of them contain '?' characters. Sometimes there are just a few, but in other cases it's pretty severe, such as

>England/OXON-FA6F00/2020
????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????
?????????????????????????????????????????????????aacgagaaaacacacgtccaactcagtttgc
ctgttttacaggttcgcgacgtgctcgtacgtggctttggagactccgtggaggaggtcttatcagaggcacgtcaacat
cttaawgatggcacttgtggcttagtagaagttgaaa???????????????????????????????????????????
acgttcggatgctcgaactgcacctcatggtcatgttatggTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC
GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAG
AACGGTAATAAAGGAGCTGGTGGCCATagttacggcgccgatctaaagtcatttgacttaggcgacgagcttggcactga
tccttatgaagattttcaagaaaactggaacactaaacatagcagtggtgttacccgtgaactcatgcgtgagcttaacg
gaggggcatacACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGtgcattaaagaccttcta
gcacgtgctggtaaaGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCG
TGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAAT
????????????????????????????????????????????????????????????????aatcaagactattcaa

About 4,000 of these sequences are present in Gisaid, and my spot check indicates the question marks are simply omitted, see for instance England/ALDP-9B5D55/2020 (EPI_ISL_572893). None of them are present in Genbank, likely because they fail Genbank's fastaedit validator (that's how I found them). Is ignoring the question marks the correct thing to do? Or should they be replaced with Ns? It matters a lot for our purposes.

It looks like all sequences come from these five codes:

ALDP
CAMC
MILK
OXON
QEUH

I'm attaching a list with all IDs.

cogQuestions.txt

rmcolq commented 2 years ago

I have been dealing with file corruption issues over the last week in CSV files, which have been making the pipeline fail. My first reaction is that maybe these are silently "handled" file corruptions we are seeing in the FASTA but will investigate.

Sent from my Galaxy

-------- Original message -------- From: Jeltje @.> Date: 24/10/2021 02:01 (GMT+00:00) To: COG-UK/dipi-group @.> Cc: Subscribed @.***> Subject: [COG-UK/dipi-group] Sequences contain question marks (Issue #145)

This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.

Hi all, please let me know who else to notify if this is not the right channel.

We (at UCSC) download https://cog-uk.s3.climb.ac.uk/phylogenetics/latest/cog_all.fasta on a regular basis and I just noticed that 7,298 of them contain '?' characters. Sometimes there are just a few, but in other cases it's pretty severe, such as

England/OXON-FA6F00/2020 ???????????????????????????????????????????????????????????????????????????????? ???????????????????????????????????????????????????????????????????????????????? ???????????????????????????????????????????????????????????????????????????????? ?????????????????????????????????????????????????aacgagaaaacacacgtccaactcagtttgc ctgttttacaggttcgcgacgtgctcgtacgtggctttggagactccgtggaggaggtcttatcagaggcacgtcaacat cttaawgatggcacttgtggcttagtagaagttgaaa??????????????????????????????????????????? acgttcggatgctcgaactgcacctcatggtcatgttatggTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAG AACGGTAATAAAGGAGCTGGTGGCCATagttacggcgccgatctaaagtcatttgacttaggcgacgagcttggcactga tccttatgaagattttcaagaaaactggaacactaaacatagcagtggtgttacccgtgaactcatgcgtgagcttaacg gaggggcatacACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGtgcattaaagaccttcta gcacgtgctggtaaaGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCG TGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAAT ????????????????????????????????????????????????????????????????aatcaagactattcaa

About 4,000 of these sequences are present in Gisaid, and my spot check indicates the question marks are simply omitted, see for instance England/ALDP-9B5D55/2020 (EPI_ISL_572893). None of them are present in Genbank, likely because they fail Genbank's fastaedit validator (that's how I found them). Is ignoring the question marks the correct thing to do? Or should they be replaced with Ns? It matters a lot for our purposes.

It looks like all sequences come from these five codes:

ALDP CAMC MILK OXON QEUH

I'm attaching a list with all IDs.

cogQuestions.txthttps://github.com/COG-UK/dipi-group/files/7403900/cogQuestions.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/COG-UK/dipi-group/issues/145, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACLIWO7LO5JU5VK3FBKQYYTUINLF3ANCNFSM5GS3EJWQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

rmcolq commented 2 years ago

No, I can confirm that the question marks are in the input sequence. I think these should perhaps be kicked out by ELAN as garbage? Datapipe replaces them with Ns during the alignment step, but there are so many ? positions that they then get filtered as low quality sequences anyway.

Short term I can replace the ? with N in datapipe so it doesn't filter through to the cog_all.fasta.

rambaut commented 2 years ago

We should reject anything coming in to datapipe with characters other than IPAC nucleotide codes. It might be indicative of other issues upstream.

rambaut commented 2 years ago

Just to confirm - the majority if these are from a consortium member (OXON) which is using a consensus calling pipeline that emits ? instead of N to denote missing bases.

SamStudio8 commented 2 years ago

We should reject anything coming in to datapipe with characters other than IPAC nucleotide codes. It might be indicative of other issues upstream.

Agreed, we did come against this data previously (#38 highlighted - but I think we were made aware of ? afterwards). This was scheduled to get discussed at the DWG (particularly as the supplier of these genomes was in that WG) but we got sidetracked. Since March @rmcolq and I have been handling non IUPAC chars downstream in Datapipe/Asklepian.

SamStudio8 commented 2 years ago

Closed by https://github.com/SamStudio8/elan-nextflow/commit/4828e44e50d4859d8612d06e03e69ca30c1b2dc9