Open macmanes opened 9 years ago
@cboursnell any thoughts on this while @Blahah is otherwise occupied?
@macmanes sorry for the delay, and for any more delays in the near future. Were all the assemblies from the same reads?
No worries - I have experience with these issues as you know..
Assemblies 1 and 2, 3 and 4, 5 and 6, 7 and 8 used the same reads sets. Each of these sets was a smaller subset of a larger read set, which is what I used during transrate.
ping
@Blahah Any chance we can work on figuring this out. Very different assemblies should ideally, NOT have very similar scores. Assembly 8.Trinity.fasta
was done with 100M PE reads, while 2.Trinity.fasta
was done with a 10M subset of the 100M dataset. Their qualities are very different yet the optimal scores are remarkable similar. The raw scores are more dissimilar, but actually not by all that much.
@macmanes I think what's happening is that the highest-expressed transcripts are being well assembled in all subsets. So they define the 'optimal assembly' because they account for the vast majority of the reads.
This suggests two things:
More in a few days on what I plan to do about it. It should be solved by diginorming the reads (if I'm right about the cause).
obviously, happy to provide reads and assemblies as needed.
Just an update to say we've figured out what we want our solution to look like, and @cboursnell is just prototyping it. It's quite a big deal to get it done right and efficiently (without slowing down transrate too much), so we're doing it carefully.
This is similar to https://github.com/Blahah/transrate/issues/146, so maybe I should have just included this comment there.
I have just used transrate to evaluate several assemblies, and in general get very similar numbers. This is fine, but the assemblies are in fact very different. The "good" assemblies vary in size from 20k transcripts to 200k. The reference coverage ranges from 25% to 35%, p_contigs_lowcovered from .9 to .71.. All of them have optimal transrate scores of between .38 and .41, which seems very similar.
Anyway, the main issue I'm hoping you'll comment on/address is if very different assemblies should have very similar scores. I get the the information content of the assemblies is largely the same (this the similar scores), but seems like there should be some reduction in score for that information being hidden amongst all those other contigs.
Here is the transrate report.