blahah / transrate

Understand your transcriptome assembly
http://hibberdlab.com/transrate
Other
100 stars 34 forks source link

weird results (v1.0.3) #241

Open manuelsmendoza opened 4 years ago

manuelsmendoza commented 4 years ago

Hi @blahah and folk!

I'm evaluating different assemblies to build a transcriptome of reference. I'm using a hybrid approach i.e. combining long-reads (PacBio) with short-reads (Illumina). After running the assembly, I've tried to evaluate it and the result is a little weird.

The total number of fragments/reads mapped to the assembly was very low at Read mapping metrics module, only 29% so I aligned the reads using another tool (bowtie2) and the result was much better (84%)... Is something wrong with TransRate metrics? Is it normal to obtain 100% of low covered and uncovered contigs? Is it normal not find any bridge?

May I trust on TransRate to remove all missassembly transcripts and continue the pipeline using only "good" transcripts?

This issue is related to #220 and #208

TRANSRATE LOG

[ INFO] 2020-07-16 15:37:25 : Calculating contig metrics...
[ INFO] 2020-07-16 15:37:48 : Contig metrics:
[ INFO] 2020-07-16 15:37:48 : -----------------------------------
[ INFO] 2020-07-16 15:37:48 : n seqs                       198411
[ INFO] 2020-07-16 15:37:48 : smallest                         72
[ INFO] 2020-07-16 15:37:48 : largest                       18293
[ INFO] 2020-07-16 15:37:48 : n bases                   165212363
[ INFO] 2020-07-16 15:37:48 : mean len                     821.66
[ INFO] 2020-07-16 15:37:48 : n under 200                   11451
[ INFO] 2020-07-16 15:37:48 : n over 1k                     41481
[ INFO] 2020-07-16 15:37:48 : n over 10k                      137
[ INFO] 2020-07-16 15:37:48 : n with orf                    35726
[ INFO] 2020-07-16 15:37:48 : mean orf percent              43.54
[ INFO] 2020-07-16 15:37:48 : n90                             290
[ INFO] 2020-07-16 15:37:48 : n70                             878
[ INFO] 2020-07-16 15:37:48 : n50                            2041
[ INFO] 2020-07-16 15:37:48 : n30                            3533
[ INFO] 2020-07-16 15:37:48 : n10                            6370
[ INFO] 2020-07-16 15:37:48 : gc                             0.33
[ INFO] 2020-07-16 15:37:48 : bases n                      425802
[ INFO] 2020-07-16 15:37:48 : proportion n                    0.0
[ INFO] 2020-07-16 15:37:48 : Contig metrics done in 23 seconds
[ INFO] 2020-07-16 15:37:48 : Calculating read diagnostics...
[ INFO] 2020-07-16 15:55:07 : Read mapping metrics:
[ INFO] 2020-07-16 15:55:07 : -----------------------------------
[ INFO] 2020-07-16 15:55:07 : fragments                  47147400
[ INFO] 2020-07-16 15:55:07 : fragments mapped           13764007
[ INFO] 2020-07-16 15:55:07 : p fragments mapped             0.29
[ INFO] 2020-07-16 15:55:07 : good mappings              12549696
[ INFO] 2020-07-16 15:55:07 : p good mapping                 0.27
[ INFO] 2020-07-16 15:55:07 : bad mappings                1214311
[ INFO] 2020-07-16 15:55:07 : potential bridges                 0
[ INFO] 2020-07-16 15:55:07 : bases uncovered            70095137
[ INFO] 2020-07-16 15:55:07 : p bases uncovered              0.42
[ INFO] 2020-07-16 15:55:07 : contigs uncovbase             99185
[ INFO] 2020-07-16 15:55:07 : p contigs uncovbase             0.5
[ INFO] 2020-07-16 15:55:07 : contigs uncovered            198411
[ INFO] 2020-07-16 15:55:07 : p contigs uncovered             1.0
[ INFO] 2020-07-16 15:55:07 : contigs lowcovered           198411
[ INFO] 2020-07-16 15:55:07 : p contigs lowcovered            1.0
[ INFO] 2020-07-16 15:55:07 : contigs segmented             33054
[ INFO] 2020-07-16 15:55:07 : p contigs segmented            0.17
[ INFO] 2020-07-16 15:55:07 : Read metrics done in 1039 seconds
[ INFO] 2020-07-16 15:55:07 : Calculating comparative metrics...
[ INFO] 2020-07-16 15:57:05 : Comparative metrics:
[ INFO] 2020-07-16 15:57:05 : -----------------------------------
[ INFO] 2020-07-16 15:57:05 : CRBB hits                     33246
[ INFO] 2020-07-16 15:57:05 : n contigs with CRBB           33246
[ INFO] 2020-07-16 15:57:05 : p contigs with CRBB            0.17
[ INFO] 2020-07-16 15:57:05 : rbh per reference              1.01
[ INFO] 2020-07-16 15:57:05 : n refs with CRBB              14438
[ INFO] 2020-07-16 15:57:05 : p refs with CRBB               0.44
[ INFO] 2020-07-16 15:57:05 : cov25                          6721
[ INFO] 2020-07-16 15:57:05 : p cov25                         0.2
[ INFO] 2020-07-16 15:57:05 : cov50                          4760
[ INFO] 2020-07-16 15:57:05 : p cov50                        0.14
[ INFO] 2020-07-16 15:57:05 : cov75                          3351
[ INFO] 2020-07-16 15:57:05 : p cov75                         0.1
[ INFO] 2020-07-16 15:57:05 : cov85                          2901
[ INFO] 2020-07-16 15:57:05 : p cov85                        0.09
[ INFO] 2020-07-16 15:57:05 : cov95                          2318
[ INFO] 2020-07-16 15:57:05 : p cov95                        0.07
[ INFO] 2020-07-16 15:57:05 : reference coverage             0.16
[ INFO] 2020-07-16 15:57:05 : Comparative metrics done in 118 seconds
[ INFO] 2020-07-16 15:57:05 : -----------------------------------
[ INFO] 2020-07-16 15:57:26 : TRANSRATE ASSEMBLY SCORE     0.1066
[ INFO] 2020-07-16 15:57:26 : -----------------------------------
[ INFO] 2020-07-16 15:57:26 : TRANSRATE OPTIMAL SCORE      0.1308
[ INFO] 2020-07-16 15:57:26 : TRANSRATE OPTIMAL CUTOFF      0.014
[ INFO] 2020-07-16 15:57:26 : good contigs                 187971
[ INFO] 2020-07-16 15:57:26 : p good contigs                 0.95

BOWTIE2 ALIGNMENT STATS

94294800 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
89466567 + 0 mapped (94.88% : N/A)
94294800 + 0 paired in sequencing
47147400 + 0 read1
47147400 + 0 read2
80055734 + 0 properly paired (84.90% : N/A)
86651820 + 0 with itself and mate mapped
2814747 + 0 singletons (2.99% : N/A)
6178086 + 0 with mate mapped to a different chr
1736018 + 0 with mate mapped to a different chr (mapQ>=5)
pmomadeira commented 2 years ago

Greetings,

I am also having a very similar problem. When using the most recent versions of transrate, the results always have the same set of issues. Number and percentage of mapped fragments is always low ( <45%), there are no potential bridges (potential_bridges = 0, always) and values like uncovered and lowcovered bases and contigs will always be very high (with percentages for those being 1 (100%)). This is simalar to issues #243 as well.

Running the exact same samples and assembly through an older version of transrate does not recover this same strange results, with mappings being much (p_fragments_mapped >80%) higher and closer to the expected results.

I've tried multiple approaches, using transrate from the original fork, @abshah's fork (https://github.com/abshah/transrate), @dfmoralesb fork (https://github.com/dfmoralesb/transrate) and even the conda repackage from @lmfaber (https://github.com/lmfaber/transrate_conda). Even when I'm able to avoid problems with dependencies like #240 , the results then return with these errors. I've attached a comparison between the results I get from different versions of the program.

A solution would be very much appreciated, since having to transfer the fq/fq.gz files between machines just to run transrate is troublesome and time consuming solution.

Best regards, Pedro TrateV101vsTrateV103-results.xlsx

blahah commented 2 years ago

@manuelsmendoza and @pmomadeira if you are willing to share the input data privately I can investigate what's happening.

Transrate doens't consider pacbio reads, so it's not appropriate to evaluate a hybrid pacbio/illumina assembly using only the illumina reads - by design your sequencing strategy creates an assembly that includes a lot of information not included in the short reads.

However, it's possible there's a bug or some non-obvious problem happening as well - in which case I'll happily debug.

pmomadeira commented 2 years ago

Greetings @blahah,

My data are regular illumina reads and assemblies so I'm not sure if the hybrid pacbio/illumina assembly is the reason for the issue. I'm using a rnaSpades built assembly with pair-read sample data. We have used it before in our lab and it worked fine, and in fact it still runs fine with an older installation of transrate v1.0.1.

I've been looking at some of the code in dfmoralesb transrate fork and at your original version to see if I could solve some of the issues I've been having. I ended up creating a new fork/branch to test some changes and I was able to reach a solution (here transratev1.0.4.1 ).

By modifying some files I was able to get transrate to work with salmon v1.7.0 and a snap-aligner v2.0.1 while also substituting the deprecated trollop for optimist and the results are coming out as expected now. Mapped contigs were slightly lower than in transrate 1.0.1, but the it's a 3-5% difference, which can just be a result of differences between versions.

I'm still quite new to this, so I'm not sure if I introduced any potential error or unintended changes to the program, so I would be glad if you could check out my branch.

Best regards, Pedro

sericomyxa commented 2 years ago

I am also getting the same results with transrate 1.0.3--bridges=0, and p contigs uncovered/lowcovered=1. @blahah, would it be possible to make the old builds available again? Bintray is no longer in use.

pmomadeira commented 2 years ago

Hi @sericomyxa . I wasn't able to solve the issue using transrate 1.0.3 but I did create a new fork that solved the bridges and mapping issues. Check it out here[https://github.com/pmomadeira/transrate] and see if it solves your issue. Hope it helps!