similar score, very different assembly

macmanes commented 9 years ago

This is similar to https://github.com/Blahah/transrate/issues/146, so maybe I should have just included this comment there.

I have just used transrate to evaluate several assemblies, and in general get very similar numbers. This is fine, but the assemblies are in fact very different. The "good" assemblies vary in size from 20k transcripts to 200k. The reference coverage ranges from 25% to 35%, p_contigs_lowcovered from .9 to .71.. All of them have optimal transrate scores of between .38 and .41, which seems very similar.

Anyway, the main issue I'm hoping you'll comment on/address is if very different assemblies should have very similar scores. I get the the information content of the assemblies is largely the same (this the similar scores), but seems like there should be some reduction in score for that information being hidden amongst all those other contigs.

Here is the transrate report.

assembly,n_seqs,smallest,largest,n_bases,mean_len,n_under_200,n_over_1k,n_over_10k,n_with_orf,mean_orf_percent,n90,n70,n50,n30,n10,gc,gc_skew,at_skew,cpg_ratio,bases_n,proportion_n,linguistic_complexity,fragments,fragments_mapped,p_fragments_mapped,good_mappings,p_good_mapping,bad_mappings,potential_bridges,bases_uncovered,p_bases_uncovered,contigs_uncovbase,p_contigs_uncovbase,contigs_uncovered,p_contigs_uncovered,contigs_lowcovered,p_contigs_lowcovered,contigs_segmented,p_contigs_segmented,CRBB_hits,n_contigs_with_CRBB,p_contigs_with_CRBB,rbh_per_reference,n_refs_with_CRBB,p_refs_with_CRBB,cov25,p_cov25,cov50,p_cov50,cov75,p_cov75,cov85,p_cov85,cov95,p_cov95,reference_coverage,score,optimal_score,cutoff
1.Trinity.fasta,65945,224,23761,62718930,951.07938,0,17063,64,18184,58.56562,330,911,1932,3207,5572,0.49122,0.00557,0.00271,1.40246,0,0,0.16099,232844421,194220776,0.83412,174232733,0.74828,19988043,42163,3150025,0.05022,15679,0.23776,2252,0.03415,5975,0.09061,5085,0.07711,19722,19722,0.29907,0.36645,14039,0.26086,12882,0.23936,11319,0.21032,9715,0.18051,9019,0.16758,8209,0.15253,0.25993,0.2796,0.41148,0.54107
2.Trinity.fasta,64991,224,23872,63602933,978.64217,0,17276,71,18110,57.91166,334,976,2035,3317,5686,0.49086,0.00804,0.00588,1.40187,0,0,0.164,232844421,194848218,0.83682,174947650,0.75135,19900568,41775,2743696,0.04314,14799,0.22771,1320,0.02031,5001,0.07695,4926,0.0758,19623,19623,0.30193,0.36461,13892,0.25812,12805,0.23793,11289,0.20976,9747,0.18111,9087,0.16884,8331,0.1548,0.25996,0.29538,0.4171,0.53387
3.Trinity.fasta,97693,224,23795,92681315,948.69965,0,23370,166,22034,54.3844,323,886,2070,3499,6227,0.48846,0.00623,0.00252,1.38938,0,0,0.15773,232844421,197113627,0.84655,177813353,0.76366,19300274,54652,6794303,0.07331,28802,0.29482,4105,0.04202,20253,0.20731,7481,0.07658,23846,23846,0.24409,0.44308,15359,0.28538,14236,0.26452,12732,0.23657,11254,0.20911,10649,0.19787,9850,0.18302,0.29847,0.27978,0.38962,0.54271
4.Trinity.fasta,95783,224,23060,92653236,967.32443,0,23493,164,21786,53.6321,325,938,2158,3574,6240,0.48774,0.0063,0.0016,1.38737,0,0,0.15997,232844421,197310533,0.84739,178282905,0.76567,19027628,53562,5740845,0.06196,26984,0.28172,2703,0.02822,18818,0.19646,7359,0.07683,23565,23565,0.24602,0.43786,15298,0.28425,14179,0.26346,12731,0.23655,11284,0.20967,10690,0.19863,9913,0.18419,0.29857,0.29555,0.39345,0.48729
5.Trinity.fasta,167978,224,23684,148394879,883.41854,0,34952,260,28309,48.99658,315,723,1848,3420,6302,0.48321,0.00668,0.00351,1.36307,0,0,0.14891,232844421,199409158,0.85641,180557965,0.77544,18851193,76640,14945077,0.10071,59803,0.35602,8675,0.05164,87521,0.52103,11262,0.06704,28852,28852,0.17176,0.53609,16793,0.31203,15695,0.29163,14202,0.26388,12640,0.23486,11973,0.22247,11188,0.20788,0.33546,0.29364,0.37819,0.43677
6.Trinity.fasta,167106,224,23630,149292197,893.39818,0,34832,316,27594,48.43238,318,730,1902,3511,6423,0.48319,0.00696,0.00176,1.36,0,0,0.14988,232844421,200288911,0.86018,181600202,0.77992,18688709,75534,14504157,0.09715,54699,0.32733,6792,0.04064,89531,0.53577,11191,0.06697,28118,28118,0.16826,0.52245,16648,0.30933,15567,0.28925,14134,0.26262,12725,0.23644,12094,0.22472,11340,0.21071,0.33426,0.31214,0.37761,0.0414
7.Trinity.fasta,240130,224,24337,202733846,844.26705,0,46935,367,32513,45.5315,315,656,1579,3201,6125,0.4793,0.00687,0.00088,1.34552,0,0,0.14458,232844421,200238303,0.85997,181987627,0.78158,18250676,91960,23123974,0.11406,98283,0.40929,14579,0.06071,172075,0.71659,14282,0.05948,32048,32048,0.13346,0.59548,17685,0.3286,16573,0.30794,15113,0.28081,13479,0.25045,12761,0.23711,11875,0.22065,0.35723,0.31067,0.40332,0.04983
8.Trinity.fasta,239868,224,24337,205515665,856.7865,0,47439,414,32221,44.98601,318,669,1630,3270,6219,0.47939,0.00752,0.00173,1.34307,0,0,0.146,232844421,201648856,0.86602,183068555,0.78623,18580301,90087,23833580,0.11597,92413,0.38527,12943,0.05396,174527,0.7276,14119,0.05886,31449,31449,0.13111,0.58435,17509,0.32533,16437,0.30541,14997,0.27866,13475,0.25038,12835,0.23848,12018,0.2233,0.35582,0.32807,0.41172,0.03878

macmanes commented 9 years ago

@cboursnell any thoughts on this while @Blahah is otherwise occupied?

blahah commented 9 years ago

@macmanes sorry for the delay, and for any more delays in the near future. Were all the assemblies from the same reads?

macmanes commented 9 years ago

No worries - I have experience with these issues as you know..

Assemblies 1 and 2, 3 and 4, 5 and 6, 7 and 8 used the same reads sets. Each of these sets was a smaller subset of a larger read set, which is what I used during transrate.

tseemann commented 9 years ago

ping

macmanes commented 9 years ago

@Blahah Any chance we can work on figuring this out. Very different assemblies should ideally, NOT have very similar scores. Assembly 8.Trinity.fasta was done with 100M PE reads, while 2.Trinity.fasta was done with a 10M subset of the 100M dataset. Their qualities are very different yet the optimal scores are remarkable similar. The raw scores are more dissimilar, but actually not by all that much.

blahah commented 9 years ago

@macmanes I think what's happening is that the highest-expressed transcripts are being well assembled in all subsets. So they define the 'optimal assembly' because they account for the vast majority of the reads.

This suggests two things:

your data is mostly those 'optimal assembly' transcripts
the expression-related bias in the transrate score is a real problem (I knew it was there, but didn't understand how it would manifest)

More in a few days on what I plan to do about it. It should be solved by diginorming the reads (if I'm right about the cause).

macmanes commented 9 years ago

obviously, happy to provide reads and assemblies as needed.

blahah commented 9 years ago

Just an update to say we've figured out what we want our solution to look like, and @cboursnell is just prototyping it. It's quite a big deal to get it done right and efficiently (without slowing down transrate too much), so we're doing it carefully.

blahah / transrate

similar score, very different assembly #155