Closed bfjia closed 4 years ago
Love this idea! I clustered this dataset and aligned to our current mapping reference and no hits except ye olde DL231478.1 Recombinant raccoon pox viruses
, so I don't think it will cause problems to add it. @ababaian -- try a test run?
awesome. what was the hit to DL231478.1. that's quite interesting to see an AMR gene aligning with a virus
DL231478.1 is from a patent, it causes many false positives which are not Coronavirus so we mask it out. It's surely not a true virus sequence.
Good idea @imasianxd!
Well, if we're going to indulge in "while you're at it..." feature creep, I would amend @imasianxd's suggestion with putting in the Virulence Factor Database to scan as well:
http://www.mgc.ac.cn/VFs/Down/VFDB_setA_pro.fas.gz
Along with AMR genes, it is relevant to improving human health to understand the distribution of VFs, and understand HGT among host-associated or environmental prokaryotes.
My $0.02
Robert you're so damn quick! Can you please check the virulence factors we can add those too but please double and then triple check that it won't interfere with any CoV finding.
In addition mapping human/pig/fish/chicken to these genes and there are no sticky sequences. We'll treat them like FLOM2.
cov3ma
I'm finishing off our vertebrates and am going to do 9000 "virome" samples next. I'll process these sequences starting the next run if they're available and checked by 2pm tomorrow.
I'm starting to feel like Serratus
is a big sail-ship and we're really catching wind.
Yeah! VFDB would also be an amazing thing to include too. VF falls into the same boat as AMR genes in term. Though I feel like we wouldn't necessarily include the entire database. It includes secretion systems that might get a lot of hits in SRA from different species and be difficult to analyse after. What you think @taltman
I can make this easy, we only change one variable at a time, so let's add AMR because it should be clean. If you can mess around a bit with VFDB and think it through so it's not rushed we'll add it.
Yeah that sounds good. I can open separate issue for virulence factors and I can curate a set of VFs.
We should also add SSU for bacteria and fungi. Adding 16S (prokaryotes) and ITS (fungi) will add only ~40Mb to the mapping reference and will enable discovery of novel species. It will be proof of concept of doing SSU metagenomics this way. We are running out of superlatives -- pan = one virus family, mega=all virus families, now what? I think we have to switch from Green to Latin roots and go with super, ultra etc. Or we could go with the larger SI units because they're funnier: my exa says we shoulda do a yotta.
Don't worry about some fraction of reference sequences being hard to interpret because it's trivial to filter out those alignments in post-processing if needed. Better get this into production fast.
The things we do need to worry about are
Large increase in number of alignments. Checking this requires biological knowledge and/or test runs on representative datasets.
Sequence similarity between a high-priority target (Cov) and a bonus target (AMR) because this could dilute the high-priority signal. This can be done be aligning AMR to Cov; I own this step.
I believe this has been implemented. Can we close the issue?
Rationale: currently characterization of environmental antimicrobial resistance genes is lacking as the potential bacterial pool is too large to sample. Using existing SRA data, if we find a homolog of human AMR genes in the wild (aka an environmental bacterial species), that would allow some insight into where to begin to characterize environmental AMR and might allow for potential delineation of lateral gene transfer of these AMR genes from the environment into the clinics.
Attached here is a collection of ~3000 AMR genes mostly found in human pathogens. card3.09_nucleotide_homolog.modified.fasta.zip
How the fasta file was generated.:
script.zip