W-L / ProblematicSites_SARS-CoV2

48 stars 15 forks source link

New masks to consider due to amplicon 64 issues #15

Open theosanderson opened 2 years ago

theosanderson commented 2 years ago

I belately saw this message from @BioWilko https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473/17

Hi, in regards to our post about erroneous mutations in ARTIC V4/4.1 3 we have now discovered a significantly higher number of affected genomes when ambiguous bases are considered (31,567 in COG-UK dataset). Assuming sequencing centres use the updated versions of the scheme BED file new sequences should not be affected but I think you should consider adding the following positions to the problematic sites mask: 19209 G/K 19210 G/R 19212 G/R 19214 G/R 19217 A/M

I agree that it makes sense to add these to the mask - I can see some issues on the UShER tree that result from these (not hundreds, but tens) [@angiehinrichs for info]

AngieHinrichs commented 2 years ago

+1

Thanks @theosanderson for the heads-up. Anecdotally, I've seen a few other sets of adjacent (or at least close) mutations that cause trouble in the Omicron branches of the tree, although I haven't got a nice analysis with evidence like @BioWilko's to explain them! I can provide lists of sequences in case anyone would like to take a look.

AngieHinrichs commented 1 year ago

Hi @LiXingguangBrandonStark -- I haven't used mask_alignment_using_vcf.py nor did I write it (from github history it looks like @conorwalker is the main author), but if you cd to the ProblematicSites_SARS-CoV2/src/ directory and then run

python3 mask_alignment_using_vcf.py

it outputs brief usage instructions:

usage: mask_alignment_using_vcf.py [-h] [-m] [-c] [-b] [-d]
                                   [-n MASK_CHARACTER] [-r REFERENCE_ID] -v
                                   VCF -i INPUT_FASTA -o OUTPUT_FASTA
mask_alignment_using_vcf.py: error: the following arguments are required: -v/--vcf, -i/--input_fasta, -o/--output_fasta

(I use different tools to mask VCF instead of fasta, using the file problematic_sites_sarsCov2.vcf.)

W-L commented 1 year ago

Hi @LiXingguangBrandonStark! Did you clone this repository? (git clone https://github.com/W-L/ProblematicSites_SARS-CoV2.git) You can then find the vcf for masking sites at ./ProblematicSites_SARS-CoV2/problematic_sites_sarsCov2.vcf and the script to mask alignments in FASTA format at ./ProblematicSites_SARS-CoV2/src/mask_alignment_using_vcf.py, with usage instructions as posted by @AngieHinrichs (Thank you!) If you encounter issues using the files, please feel free to open a new issue.

W-L commented 1 year ago

The vcf has a column named FILTER with a recommendation for each site to either mask it before performing downstream analyses or to otherwise be cautious with interpreting results due to potential misleading effects that the site may cause. You can find more info about this in the original post on virological.org. The files in subset_vcf separate the sites from the main vcf into these two categories