ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
251 stars 33 forks source link

"Assemblability" score -- selecting datasets to assemble #170

Closed rcedgar closed 3 years ago

rcedgar commented 4 years ago

Moving this discussion from slack. @ababaian suggested assembling 1k datasets for selected ranges of sumzer score, say 90-100, 80-89, 70-79 etc. with the goal of measuring how many of these datasets can be assembled in each score range. This would convert sumzer score to probability that a Cov can be assembled, which would inform a decision about where to set a score cutoff.

The summarizer score is designed to predict whether the dataset contains a given virus family. Whether this virus can be assembled is a different question.

If the goal is to predict assemblability, then a different score can be designed to take relevant factors into account, in particular read depth. Other relevant factors include read length, paired vs. unpaired, identity, and sequencing technology.

We don't have assemblers for long read/high error-rate technologies (I-T, PacBio) so these would score zero for now.

There is probably some minimum read length for minia & coronaspades, we should find out what this so that those datasets score zero.

If identity is low, then coverage will be underestimated due to low sensitivity of bowtie2.

With these points in mind, a first try at an assemblability score (a-score) could be:

assembly_score = coverage*depth_factor*identity_factor.

The depth_factorwould be a function which grows quickly from the minimum (depth=1) to a good depth (say, depth=3) and grows slowly for depth>3. The identity_factorwould be 1 for high identities, then grow more quickly as the identity drops below 90% to compensate for the falling sensitivity of bowtie2, with values of say 3 or more below 80%. An a-score like this will be a better predictor of assemblability.

ababaian commented 4 years ago

One important note which has come up in going over the CoV+ classification data is that many "single-cell RNAseq" libraries are yielding CoV/Viral+ hits but have very poor coverage over the genome for the virus itself. It appears that this is a combination of high-error rate and PCR-duplications which is causing this.

SRR10338100

SUMZER_COMMENT=sra=SRR10338100,genome=cov3ma,date=200607-11:24;
family=Poxviridae;score=56;pctid=100;aln=111;glb=47;panlen=181133;cvg=_.oo..oOO._.o..____.o.O.O;top=NC_024446.1;topaln=15;toplen=306862;topname=Penguinpox virus isolate PSan92;
family=Herpesviridae;score=44;pctid=100;aln=104;glb=16;panlen=148687;cvg=__O._oo__.__O__Oo_oO.O__O;top=NC_002665.1;topaln=5;toplen=108873;topname=Bovine herpesvirus 4 long unique region;
family=Coronaviridae;score=40;pctid=100;aln=99;glb=6;panlen=30000;cvg=._.O__...__O..Oo.o___oO._;top=JX548304.1;topaln=6;toplen=13216;topname=Shorebird coronavirus isolate 85 RNA polymerase gene, partial cds;

...
acc=NC_006998.1;pctid=99.9;aln=45;glb=44;len=194711;cvgpct=24;len=194711;depth=0.0223;cvg=_ooOo.OO__o._.___..O_Ooo.;fam=Poxviridae;name=Vaccinia virus;
acc=NC_014969.1;pctid=100.0;aln=4;glb=0;len=44055;cvgpct=4;len=44055;depth=0.00209;cvg=________________________O;fam=Adenoviridae;name=Fowl adenovirus E;
acc=NC_038276.1;pctid=100.0;aln=4;glb=0;len=10881;cvgpct=4;len=10881;depth=0.00846;cvg=___________O_____________;fam=Rhabdoviridae;name=Nishimuro virus viral cRNA for hypothetical proteins;
acc=MG764156.1;pctid=100.0;aln=3;glb=0;len=13195;cvgpct=4;len=13195;depth=0.00523;cvg=O________________________;fam=Coronaviridae;name=Avian coronavirus isolate CoV-9698-2016/11/28-BP/SA orf1ab polyprotein (orf1ab) gene, partial cds;
acc=MG916904.1;pctid=100.0;aln=7;glb=0;len=28751;cvgpct=4;len=28751;depth=0.0056;cvg=______________O__________;fam=Coronaviridae;name=Bat coronavirus BtCoV/Rh/YN2012 isolate BtCoV/Rh/YN2012_Ra13591, complete genome;
acc=NC_036582.1;pctid=100.0;aln=4;glb=0;len=293123;cvgpct=4;len=293123;depth=0.000314;cvg=_____________________O___;fam=Poxviridae;name=Flamingopox virus FGPVKD09;
acc=AJ575397.1;pctid=100.0;aln=5;glb=0;len=20509;cvgpct=4;len=20509;depth=0.00585;cvg=O________________________;fam=Coronaviridae;name=Infectious bronchitis virus partial s1 gene for spike glycoprotein, S1 subunit, genomic RNA, isolate RF/17/98;
acc=NC_004065.1;pctid=100.0;aln=11;glb=0;len=230278;cvgpct=4;len=230278;depth=0.00114;cvg=__________________..___O_;fam=Herpesviridae;name=Murid herpesvirus 1;
acc=NC_020474.2;pctid=100.0;aln=4;glb=0;len=180421;cvgpct=4;len=180421;depth=0.000532;cvg=____O____________________;fam=Herpesviridae;name=Elephantid herpesvirus 1;
acc=NC_022613.1;pctid=100.0;aln=6;glb=0;len=43686;cvgpct=4;len=43686;depth=0.00316;cvg=____O____________________;fam=Adenoviridae;name=Turkey adenovirus 5 isolate 1277BT;
acc=NC_006879.1;pctid=100.0;aln=5;glb=0;len=34450;cvgpct=4;len=34450;depth=0.00334;cvg=_________________O_______;fam=Adenoviridae;name=Simian adenovirus 1;
acc=NC_001731.1;pctid=100.0;aln=7;glb=0;len=190289;cvgpct=4;len=190289;depth=0.000846;cvg=_________o_O_____________;fam=Poxviridae;name=Molluscum contagiosum virus subtype 1;
acc=NC_003310.1;pctid=100.0;aln=8;glb=0;len=196858;cvgpct=4;len=196858;depth=0.00098;cvg=____________________O____;fam=Poxviridae;name=Monkeypox virus Zaire-96-I-16;
acc=NC_027016.

Negative filter of scRNAseq

I've added a scRNA-seq SraRunInfo.csv which we can use to seperate scRNAseq (which are likely no good for assembly) from other RNAseq libraries. This is available at s3://lovelywater/sra/scRNA_SraRunInfo.csv

rchikhi commented 4 years ago

some single cell RNAseqs may only sequence the 3-prime end of transcripts, and some others are full-length, see https://www.frontiersin.org/files/Articles/441123/fgene-10-00317-HTML/image_m/fgene-10-00317-t001.jpg

ababaian commented 4 years ago

Even if it's not 3' end scRNAseq, do you really think we can assemble them?

rchikhi commented 4 years ago

if it's full-length and has CoV hits, it's worth a try. EDIT: enough cov hits :)

rcedgar commented 4 years ago

@ababaian it seems you/someone on the team implemented something like an "assemblabilty score" in R and generated lists of sets to be assembled after the initial 1k test, but I can't find the lists or the score implementation -- can you comment here what you did, then we can close this issue?

ababaian commented 4 years ago

So there's no good way to do this outright it looks like. Right now it's simply a filter based on score (>60), aligned reads (>200) per family. Assembly was cost-optimized a bit so I don't think we have to be as stingy, we can tolerate a poorer classifier and just assemble a bunch of junk. I'm still exploring the data a bit to figure out how to do this better. I think with a collection of CatA and CatB assemblies we can cross-reference the report data with assembly outcome and try and create a classifer based on that. This issue is far from closed, I haven't even begun to really address it.