Selection of assemblies into master table (category, filtering method)

rchikhi commented 4 years ago

Let's have a discussion on what is the bar each assembly has to pass, in order to make it into the master table.

We had categories (recalling from https://github.com/ababaian/serratus/issues/162):

A 1 contig of length x > 25 Kbp (perfect or near-perfect genomes)
B >1 contigs of total length x > 25 Kbp (need some work but possibly salvageable genomes)
C total length 5 kbp < x < 25 Kbp (incomplete genomes)
D >= 1 contigs not satisfying any of the above conditions
E 0 contigs

Now, independently of category, let's discuss filtering.

To facilitate discussion, I'll introduce the following shorthands:

gc for coronaspades.gene_clusters.fa, which is the HMM-guided and somewhat already filtered for CoV output of coronaSPAdes
gc.cv for coronaspades.gene_clusters.checkv_filtered.fa, which is the CheckV-filtered version of gc

(So, gc.cv is a subset of gc)

Note that all assemblies analyzed so far were gc.cv. In particular, catA-v3.txt is the list of categories A gc.cv's.

Recently, in our search to find deeper hits in vertebrates, I noticed that gc.cv files were often empty, but gc wasn't. This means that CheckV didn't find any CoV contigs, but maybe a CoV was still assembled. This motivated @asl to propose another criteria (in https://github.com/ababaian/serratus/issues/185): >= 2 BGC candidates, as determined by coronaSPAdes' own HMM detection, reported in bgc_candidates.txt. It is a somewhat obscure criteria to me (still based on HMMs) but the takeaway is that it is a different CoV filtering mechanism than CheckV.

Out of the 27k datasets I have assembled so far, here is the number of accessions where the gc has X BGC candidates:

In particular, 7646 accessions have >= 2 BGC candidates. This is in contrast with the 8465 accessions having a category A,B,C or D in gc.cv.

Those two methods of detecting CoV's (CheckV and BGC) somewhat agree but not fully.

Several questions:

Regardless of filtering, do we include in the master table only the catA's, or the catB, catC, catD too?
Shall we take only gc.cv's, or also include the gc with >=2 BGCs?

rcedgar commented 4 years ago

IMO we should do this: (1) include ALL assemblies where there is ANY evidence for Cov whatsoever, even if very weak, (2) include ALL relevant evidence for presence of Cov that can be reduced to a few numbers. We have a small number of assemblies, so size is not an issue; it will be reduced to smaller tables in various different ways for the paper.

rchikhi commented 4 years ago

regarding (1): very weak would mean even include >= 1 BGC, some of them are clearly spurious hits where the HMM matched only 3 a.a's. I'm fine with it, just mentioning it. So one possibility would be to take the union of all non-empty gc.cv (catA+B+C+D) and all >= 1 BGC hits.

rcedgar commented 4 years ago

Sensitivity should be maximized for the master table, regardless of how many FPs are included. Discard assemblies only if you are 100% certain they have no Cov. This approach does no harm AFAICS, while discarding Covs is potentially harmful. A later question is selecting subsets to show in a reduced table of putative Covs.

rchikhi commented 4 years ago

Fine by me! @ababaian, @asl ?

rcedgar commented 4 years ago

Including FPs is positively useful because we can use this information to validate the classifier, show the top end of the ROC curve conceptually at least. Another field in the master table should be the nt & protein classifier scores.

ababaian commented 4 years ago

So the issue with those HMM hits is that things like RdRP_1 are built on many viruses, this likely means there's a virus there and there's a good chance it's novel if but likely not a CoV, I'd say include all data into the master table, especially if it's a long contig that got assembled but we're in a gray area with what we're looking at. We'll need a supplementary table of each assembly and the gc hits that were present in it. We look at those but it's non-trivial where we draw the line. Let me think on it.

rchikhi commented 4 years ago

There is an additional complication. Suppose, for a given accession,

the gc.cv assembly is a single 15 Kbp contig
the gc >= 1 BGC hits correspond to 3 contigs (possibly including the 15 Kbp contig). Which is the assembly we should associate to this accession?

rchikhi commented 4 years ago

I'm tempted to go with gc.cv whenever it's non-empty, and fallback to gc >=1 BGC hits when gc.cv is empty. Did that in https://github.com/ababaian/serratus/issues/185

rchikhi commented 4 years ago

So the issue with those HMM hits is that things like RdRP_1 are built on many viruses, this likely means there's a virus there and there's a good chance it's novel if but likely not a CoV,

(sidenote: by 'BGC hits' I mean hits where BGC reports the "CoV" word explicitly, and not just any RdRP)

rchikhi commented 4 years ago

so I ended up doing the procedure from https://github.com/ababaian/serratus/issues/197#issuecomment-657448202 for the master table. The only caveat I see in this procedure is: if checkv selects only a small fraction of a cov genome, and BGC cov hits happen to have recovered the rest, then we would miss that rest.

ababaian / serratus

Selection of assemblies into master table (category, filtering method) #197