Closed Jeltje closed 2 years ago
I have been dealing with file corruption issues over the last week in CSV files, which have been making the pipeline fail. My first reaction is that maybe these are silently "handled" file corruptions we are seeing in the FASTA but will investigate.
Sent from my Galaxy
-------- Original message -------- From: Jeltje @.> Date: 24/10/2021 02:01 (GMT+00:00) To: COG-UK/dipi-group @.> Cc: Subscribed @.***> Subject: [COG-UK/dipi-group] Sequences contain question marks (Issue #145)
This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.
Hi all, please let me know who else to notify if this is not the right channel.
We (at UCSC) download https://cog-uk.s3.climb.ac.uk/phylogenetics/latest/cog_all.fasta on a regular basis and I just noticed that 7,298 of them contain '?' characters. Sometimes there are just a few, but in other cases it's pretty severe, such as
England/OXON-FA6F00/2020 ???????????????????????????????????????????????????????????????????????????????? ???????????????????????????????????????????????????????????????????????????????? ???????????????????????????????????????????????????????????????????????????????? ?????????????????????????????????????????????????aacgagaaaacacacgtccaactcagtttgc ctgttttacaggttcgcgacgtgctcgtacgtggctttggagactccgtggaggaggtcttatcagaggcacgtcaacat cttaawgatggcacttgtggcttagtagaagttgaaa??????????????????????????????????????????? acgttcggatgctcgaactgcacctcatggtcatgttatggTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTC GTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAG AACGGTAATAAAGGAGCTGGTGGCCATagttacggcgccgatctaaagtcatttgacttaggcgacgagcttggcactga tccttatgaagattttcaagaaaactggaacactaaacatagcagtggtgttacccgtgaactcatgcgtgagcttaacg gaggggcatacACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGtgcattaaagaccttcta gcacgtgctggtaaaGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCG TGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATTGCAGACACCTTTTGAAATTAAAT ????????????????????????????????????????????????????????????????aatcaagactattcaa
About 4,000 of these sequences are present in Gisaid, and my spot check indicates the question marks are simply omitted, see for instance England/ALDP-9B5D55/2020 (EPI_ISL_572893). None of them are present in Genbank, likely because they fail Genbank's fastaedit validator (that's how I found them). Is ignoring the question marks the correct thing to do? Or should they be replaced with Ns? It matters a lot for our purposes.
It looks like all sequences come from these five codes:
ALDP CAMC MILK OXON QEUH
I'm attaching a list with all IDs.
cogQuestions.txthttps://github.com/COG-UK/dipi-group/files/7403900/cogQuestions.txt
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/COG-UK/dipi-group/issues/145, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACLIWO7LO5JU5VK3FBKQYYTUINLF3ANCNFSM5GS3EJWQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
No, I can confirm that the question marks are in the input sequence. I think these should perhaps be kicked out by ELAN as garbage? Datapipe replaces them with Ns during the alignment step, but there are so many ? positions that they then get filtered as low quality sequences anyway.
Short term I can replace the ?
with N
in datapipe so it doesn't filter through to the cog_all.fasta
.
We should reject anything coming in to datapipe with characters other than IPAC nucleotide codes. It might be indicative of other issues upstream.
Just to confirm - the majority if these are from a consortium member (OXON) which is using a consensus calling pipeline that emits ?
instead of N
to denote missing bases.
We should reject anything coming in to datapipe with characters other than IPAC nucleotide codes. It might be indicative of other issues upstream.
Agreed, we did come against this data previously (#38 highlighted -
but I think we were made aware of ?
afterwards). This was scheduled to get discussed at the DWG (particularly as the supplier of these genomes was in that WG) but we got sidetracked. Since March @rmcolq and I have been handling non IUPAC chars downstream in Datapipe/Asklepian.
Hi all, please let me know who else to notify if this is not the right channel.
We (at UCSC) download https://cog-uk.s3.climb.ac.uk/phylogenetics/latest/cog_all.fasta on a regular basis and I just noticed that 7,298 of them contain '?' characters. Sometimes there are just a few, but in other cases it's pretty severe, such as
About 4,000 of these sequences are present in Gisaid, and my spot check indicates the question marks are simply omitted, see for instance England/ALDP-9B5D55/2020 (EPI_ISL_572893). None of them are present in Genbank, likely because they fail Genbank's
fastaedit
validator (that's how I found them). Is ignoring the question marks the correct thing to do? Or should they be replaced with Ns? It matters a lot for our purposes.It looks like all sequences come from these five codes:
I'm attaching a list with all IDs.
cogQuestions.txt