ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

finding environmental bacterial with virulence factors. #136

Closed bfjia closed 4 years ago

bfjia commented 4 years ago

Branching from issue #135. On top of AMR genes, scanning for virulence factors present in environmental bacteria would also be interesting.

TODO: virulence factor database needs to be curated to remove non-pathogenic protein and systems (e.g. secretion systems) to limit the potential pool of hits.

rcedgar commented 4 years ago

What fraction of transcripts do you expect to be bacterial? We can handle large numbers of hits, but not if they are in the same ballpark as host. Ideally we would include a good set of marker genes such as 16S, COX2, ITS, BPHD, alkB, nifH etc. SSU is easy, I'm going to screen 16S and ITS database for Cov-related problems this morning. Edit -- ITS for fungi, obviously. The forgotten microbes.

ababaian commented 4 years ago

16S is going to explode the analysis, we're going to have hits everywhere all the time. Before we venture off into a LWIA let's see how AMR behaves.

rcedgar commented 4 years ago

Hits everywhere is good up to a point because these hits have scientific value. SSU is 0.03% of a bacterial genome, is it really going to blow up the number of alignments? If so, I would agree punt for now. Can we run a yotta-genome test including SSU and AMR on the ~100 we did for FLOM screen? Edit -- I suppose it could be a much larger fraction of the transcripts. This is my first time with RNA-seq, learning as I go.

rcedgar commented 4 years ago

Ok, but I think we should do a 16S proof of concept batch for the paper. LHF for us.

ababaian commented 4 years ago

That's a good idea yes, what would be a good application / test case of ~10,000 samples?

rcedgar commented 4 years ago

Aim for diversity across as many different microbial ecosystems as possible, the main variables are host species and tissue type / feces or whatever, and environments: soil, seawater, windshield bug splatter, space station control surfaces etc. (I'm not making these last two up, there are 16S studies of both). Edit -- To save time, just fine to take a random subset of the SRA. Hits everywhere. Include DNA as well as RNA, though SSU is much harder to find in DNA as I have just learned.

bfjia commented 4 years ago

Maybe i should clarify, there are no 16s SSU sequences/transcripts in VFDB or the collection of AMR genes. by non-pathogenic proteins i mean virulence genes that are very abundant across both human and environmental bacterial species and would be very difficult to understand in term of lateral gene transfer.

rcedgar commented 4 years ago

My bad for confusing the issue, I wanted to add 16S to detect novel bacteria. This might be a good idea with DNA but not with RNA because there are too many ribosomal transcripts. Understood re. virulence genes. I have a set of vertebrate genomes I use for screening (human, bat, chicken, fish, pig), I can run against those if that would help.

ababaian commented 4 years ago

Good to close for now? We put this on the back-burner until we have more time to dedicate.

ababaian commented 4 years ago

Good idea, maybe we have time to do this next month. Closing fornow.