ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
260 stars 34 forks source link

Using Serratus to find environmental bacteria harbouring antimicrobial resistance genes. #135

Closed bfjia closed 4 years ago

bfjia commented 4 years ago

Rationale: currently characterization of environmental antimicrobial resistance genes is lacking as the potential bacterial pool is too large to sample. Using existing SRA data, if we find a homolog of human AMR genes in the wild (aka an environmental bacterial species), that would allow some insight into where to begin to characterize environmental AMR and might allow for potential delineation of lateral gene transfer of these AMR genes from the environment into the clinics.

Attached here is a collection of ~3000 AMR genes mostly found in human pathogens. card3.09_nucleotide_homolog.modified.fasta.zip

How the fasta file was generated.:

  1. The fasta was built from an existing AMR gene database called "Comprehensive Antibiotic Resistance Database" (CARD; v3.0.9) available: https://card.mcmaster.ca/download/0/broadstreet-v3.0.9.tar.bz2
  2. The original FASTA file is the "nucleotide_fasta_protein_homolog_model.fasta
  3. The headers are modified using a C# script attached below to follow format ">Accession ID,Gene_Name,Bacteria_Species"

script.zip

rcedgar commented 4 years ago

Love this idea! I clustered this dataset and aligned to our current mapping reference and no hits except ye olde DL231478.1 Recombinant raccoon pox viruses, so I don't think it will cause problems to add it. @ababaian -- try a test run?

bfjia commented 4 years ago

awesome. what was the hit to DL231478.1. that's quite interesting to see an AMR gene aligning with a virus

rcedgar commented 4 years ago

DL231478.1 is from a patent, it causes many false positives which are not Coronavirus so we mask it out. It's surely not a true virus sequence.

taltman commented 4 years ago

Good idea @imasianxd!

Well, if we're going to indulge in "while you're at it..." feature creep, I would amend @imasianxd's suggestion with putting in the Virulence Factor Database to scan as well:

http://www.mgc.ac.cn/VFs/Down/VFDB_setA_pro.fas.gz

Along with AMR genes, it is relevant to improving human health to understand the distribution of VFs, and understand HGT among host-associated or environmental prokaryotes.

My $0.02

ababaian commented 4 years ago

Robert you're so damn quick! Can you please check the virulence factors we can add those too but please double and then triple check that it won't interfere with any CoV finding.

In addition mapping human/pig/fish/chicken to these genes and there are no sticky sequences. We'll treat them like FLOM2.

Revision: cov3ma

I'm finishing off our vertebrates and am going to do 9000 "virome" samples next. I'll process these sequences starting the next run if they're available and checked by 2pm tomorrow.

I'm starting to feel like Serratus is a big sail-ship and we're really catching wind.

bfjia commented 4 years ago

Yeah! VFDB would also be an amazing thing to include too. VF falls into the same boat as AMR genes in term. Though I feel like we wouldn't necessarily include the entire database. It includes secretion systems that might get a lot of hits in SRA from different species and be difficult to analyse after. What you think @taltman

ababaian commented 4 years ago

I can make this easy, we only change one variable at a time, so let's add AMR because it should be clean. If you can mess around a bit with VFDB and think it through so it's not rushed we'll add it.

bfjia commented 4 years ago

Yeah that sounds good. I can open separate issue for virulence factors and I can curate a set of VFs.

rcedgar commented 4 years ago

We should also add SSU for bacteria and fungi. Adding 16S (prokaryotes) and ITS (fungi) will add only ~40Mb to the mapping reference and will enable discovery of novel species. It will be proof of concept of doing SSU metagenomics this way. We are running out of superlatives -- pan = one virus family, mega=all virus families, now what? I think we have to switch from Green to Latin roots and go with super, ultra etc. Or we could go with the larger SI units because they're funnier: my exa says we shoulda do a yotta.

rcedgar commented 4 years ago

Don't worry about some fraction of reference sequences being hard to interpret because it's trivial to filter out those alignments in post-processing if needed. Better get this into production fast.

The things we do need to worry about are

  1. Large increase in number of alignments. Checking this requires biological knowledge and/or test runs on representative datasets.

  2. Sequence similarity between a high-priority target (Cov) and a bonus target (AMR) because this could dilute the high-priority signal. This can be done be aligning AMR to Cov; I own this step.

asl commented 4 years ago
– how about looking for AMR genes in assemblies? For example, we could align NCBI-AMR HMMS down to assembly graph to extract putative gene sequences for further annotation & analysis?
rcedgar commented 4 years ago

I believe this has been implemented. Can we close the issue?