Closed rchikhi closed 3 years ago
IMO we should do this: (1) include ALL assemblies where there is ANY evidence for Cov whatsoever, even if very weak, (2) include ALL relevant evidence for presence of Cov that can be reduced to a few numbers. We have a small number of assemblies, so size is not an issue; it will be reduced to smaller tables in various different ways for the paper.
regarding (1): very weak would mean even include >= 1 BGC, some of them are clearly spurious hits where the HMM matched only 3 a.a's. I'm fine with it, just mentioning it. So one possibility would be to take the union of all non-empty gc.cv
(catA+B+C+D) and all >= 1 BGC hits.
Sensitivity should be maximized for the master table, regardless of how many FPs are included. Discard assemblies only if you are 100% certain they have no Cov. This approach does no harm AFAICS, while discarding Covs is potentially harmful. A later question is selecting subsets to show in a reduced table of putative Covs.
Fine by me! @ababaian, @asl ?
Including FPs is positively useful because we can use this information to validate the classifier, show the top end of the ROC curve conceptually at least. Another field in the master table should be the nt & protein classifier scores.
So the issue with those HMM hits is that things like RdRP_1 are built on many viruses, this likely means there's a virus there and there's a good chance it's novel if but likely not a CoV, I'd say include all data into the master table, especially if it's a long contig that got assembled but we're in a gray area with what we're looking at. We'll need a supplementary table of each assembly and the gc
hits that were present in it. We look at those but it's non-trivial where we draw the line. Let me think on it.
There is an additional complication. Suppose, for a given accession,
gc.cv
assembly is a single 15 Kbp contiggc
>= 1 BGC hits correspond to 3 contigs (possibly including the 15 Kbp contig).
Which is the assembly we should associate to this accession?I'm tempted to go with gc.cv
whenever it's non-empty, and fallback to gc
>=1 BGC hits when gc.cv
is empty. Did that in https://github.com/ababaian/serratus/issues/185
So the issue with those HMM hits is that things like RdRP_1 are built on many viruses, this likely means there's a virus there and there's a good chance it's novel if but likely not a CoV,
(sidenote: by 'BGC hits' I mean hits where BGC reports the "CoV" word explicitly, and not just any RdRP)
so I ended up doing the procedure from https://github.com/ababaian/serratus/issues/197#issuecomment-657448202 for the master table. The only caveat I see in this procedure is: if checkv selects only a small fraction of a cov genome, and BGC cov hits happen to have recovered the rest, then we would miss that rest.
Let's have a discussion on what is the bar each assembly has to pass, in order to make it into the master table.
We had categories (recalling from https://github.com/ababaian/serratus/issues/162):
A
1 contig of length x > 25 Kbp (perfect or near-perfect genomes)B
>1 contigs of total length x > 25 Kbp (need some work but possibly salvageable genomes)C
total length 5 kbp < x < 25 Kbp (incomplete genomes)D
>= 1 contigs not satisfying any of the above conditionsE
0 contigsNow, independently of category, let's discuss filtering.
To facilitate discussion, I'll introduce the following shorthands:
gc
forcoronaspades.gene_clusters.fa
, which is the HMM-guided and somewhat already filtered for CoV output of coronaSPAdesgc.cv
forcoronaspades.gene_clusters.checkv_filtered.fa
, which is the CheckV-filtered version ofgc
(So,
gc.cv
is a subset ofgc
)Note that all assemblies analyzed so far were
gc.cv
. In particular,catA-v3.txt
is the list of categories Agc.cv
's.Recently, in our search to find deeper hits in vertebrates, I noticed that
gc.cv
files were often empty, butgc
wasn't. This means that CheckV didn't find any CoV contigs, but maybe a CoV was still assembled. This motivated @asl to propose another criteria (in https://github.com/ababaian/serratus/issues/185): >= 2 BGC candidates, as determined by coronaSPAdes' own HMM detection, reported inbgc_candidates.txt
. It is a somewhat obscure criteria to me (still based on HMMs) but the takeaway is that it is a different CoV filtering mechanism than CheckV.Out of the 27k datasets I have assembled so far, here is the number of accessions where the
gc
has X BGC candidates:In particular, 7646 accessions have >= 2 BGC candidates. This is in contrast with the 8465 accessions having a category A,B,C or D in
gc.cv
.Those two methods of detecting CoV's (CheckV and BGC) somewhat agree but not fully.
Several questions:
gc.cv
's, or also include thegc
with >=2 BGCs?