ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

Selection of assemblies into master table (category, filtering method) #197

Closed rchikhi closed 3 years ago

rchikhi commented 4 years ago

Let's have a discussion on what is the bar each assembly has to pass, in order to make it into the master table.

We had categories (recalling from https://github.com/ababaian/serratus/issues/162):

Now, independently of category, let's discuss filtering.

To facilitate discussion, I'll introduce the following shorthands:

(So, gc.cv is a subset of gc)

Note that all assemblies analyzed so far were gc.cv. In particular, catA-v3.txt is the list of categories A gc.cv's.

Recently, in our search to find deeper hits in vertebrates, I noticed that gc.cv files were often empty, but gc wasn't. This means that CheckV didn't find any CoV contigs, but maybe a CoV was still assembled. This motivated @asl to propose another criteria (in https://github.com/ababaian/serratus/issues/185): >= 2 BGC candidates, as determined by coronaSPAdes' own HMM detection, reported in bgc_candidates.txt. It is a somewhat obscure criteria to me (still based on HMMs) but the takeaway is that it is a different CoV filtering mechanism than CheckV.

Out of the 27k datasets I have assembled so far, here is the number of accessions where the gc has X BGC candidates: image

In particular, 7646 accessions have >= 2 BGC candidates. This is in contrast with the 8465 accessions having a category A,B,C or D in gc.cv.

Those two methods of detecting CoV's (CheckV and BGC) somewhat agree but not fully.

image

Several questions:

  1. Regardless of filtering, do we include in the master table only the catA's, or the catB, catC, catD too?
  2. Shall we take only gc.cv's, or also include the gc with >=2 BGCs?
rcedgar commented 4 years ago

IMO we should do this: (1) include ALL assemblies where there is ANY evidence for Cov whatsoever, even if very weak, (2) include ALL relevant evidence for presence of Cov that can be reduced to a few numbers. We have a small number of assemblies, so size is not an issue; it will be reduced to smaller tables in various different ways for the paper.

rchikhi commented 4 years ago

regarding (1): very weak would mean even include >= 1 BGC, some of them are clearly spurious hits where the HMM matched only 3 a.a's. I'm fine with it, just mentioning it. So one possibility would be to take the union of all non-empty gc.cv (catA+B+C+D) and all >= 1 BGC hits.

rcedgar commented 4 years ago

Sensitivity should be maximized for the master table, regardless of how many FPs are included. Discard assemblies only if you are 100% certain they have no Cov. This approach does no harm AFAICS, while discarding Covs is potentially harmful. A later question is selecting subsets to show in a reduced table of putative Covs.

rchikhi commented 4 years ago

Fine by me! @ababaian, @asl ?

rcedgar commented 4 years ago

Including FPs is positively useful because we can use this information to validate the classifier, show the top end of the ROC curve conceptually at least. Another field in the master table should be the nt & protein classifier scores.

ababaian commented 4 years ago

So the issue with those HMM hits is that things like RdRP_1 are built on many viruses, this likely means there's a virus there and there's a good chance it's novel if but likely not a CoV, I'd say include all data into the master table, especially if it's a long contig that got assembled but we're in a gray area with what we're looking at. We'll need a supplementary table of each assembly and the gc hits that were present in it. We look at those but it's non-trivial where we draw the line. Let me think on it.

rchikhi commented 4 years ago

There is an additional complication. Suppose, for a given accession,

rchikhi commented 4 years ago

I'm tempted to go with gc.cv whenever it's non-empty, and fallback to gc >=1 BGC hits when gc.cv is empty. Did that in https://github.com/ababaian/serratus/issues/185

rchikhi commented 4 years ago

So the issue with those HMM hits is that things like RdRP_1 are built on many viruses, this likely means there's a virus there and there's a good chance it's novel if but likely not a CoV,

(sidenote: by 'BGC hits' I mean hits where BGC reports the "CoV" word explicitly, and not just any RdRP)

rchikhi commented 4 years ago

so I ended up doing the procedure from https://github.com/ababaian/serratus/issues/197#issuecomment-657448202 for the master table. The only caveat I see in this procedure is: if checkv selects only a small fraction of a cov genome, and BGC cov hits happen to have recovered the rest, then we would miss that rest.