blahah / transrate

Understand your transcriptome assembly
http://hibberdlab.com/transrate
Other
100 stars 34 forks source link

Transcriptome too large? #189

Closed jorvis closed 8 years ago

jorvis commented 8 years ago

My first run of transrate completed, but with errors in the snap portion like this:

Trying to use too many overflow entries. To index this genome, you either need a larger seed size or a larger location size.

Attaching the full run command and output. transrate.txt

blahah commented 8 years ago

Indeed, it looks as though there is too much information in the transcriptome to fit inside the SNAP index with the seed size transrate has chosen.

It looks like there's some weird stuff in this assembly - your longest contig is 166,145 bases which is unfeasibly long. And there are 2.2 billion bases, which is the size of a medium-large genome and much larger than any real transcriptome I've ever seen.

The fix for the SNAP error is probably for us to detect this error inside transrate and then increase the seed size until it works. I'll see if I can get this into the next release.

However, this won't fix the issues with the assembly - how was this generated?

jorvis commented 8 years ago

Yes, it's large. I'm in the process of evaluating different merge/reduction tools for transcriptomics data. That file is the combination of several different Trinity assemblies with Velvet/Oases ones with different parameters. Trying to compare here with other reduction tools like TGICL, EvidentialGene, etc.

rob-p commented 8 years ago

I'll just note here that (after fixing a kink in the 64-bit index) Salmon was able to process this giant transcriptome without too much trouble ;) --- yay fish!

blahah commented 8 years ago

@jorvis you might take a look at https://github.com/cboursnell/transfuse for merging/reduction - except transrate will need to be able to run on this transcriptome before transfuse can process it.

jorvis commented 8 years ago

@Blahah, thanks, I'd be happy to do that. Let me know when I should try a transrate update. Is the snap issue with the entire transcriptome size or would filtering out the silly-long transcripts perhaps fix the issue?

blahah commented 8 years ago

@jorvis it's probably to do with the very large number of transcripts, and what must be a huge amount of redundancy in the file. Quite likely you have many many copies of most transcripts, which means a read might have a very large number of equally likely candidate locations.

The 166,145 base sequence is certainly not a transcript - possibly it's a plastid genome (chloroplast?) or a contaminant. Or it's an artefact. Either way I'd say it's safe to remove it before transrating. Same for anything under the length of two reads put together - transrate ignores these anyway, but they do slow down the aligner. So removing those would help. However, it's the 618,723 contigs between 1 and 10k that are causing the major issue - one way forward would be to de-duplicate these before continuing. You could use CD-HIT-EST with a 100% ID cutoff, or VSEARCH with the same.

How long are the reads?

jorvis commented 8 years ago

OK, I'll look into de-duplication first. The reads are just around 100bp (have been processed with Trimmomatic)

jorvis commented 8 years ago

An update. I removed all transcripts under 200bp and over 100,000bp and the ran CD-HIT-EST to remove duplicates. This reduced the 2,027,284 transcripts to 1,796,079. Transrate failed on it again with the same "[ERROR] 2016-04-05 22:31:36 : Failed to build Snap index" message.

So I looked into the code and found this file where the snap index creation was happening:

transrate/lib/app/lib/transrate/snap.rb

I looked at the parameter options there, and found that the command line snap-aligner invocation was something like this:

transrate/bin/snap-aligner index A1_trinity_oases_merged.sizefiltered.nodups.fasta foo_snap_index -s 23 -t16 -bSpace -locationSize 4

I set my own thread and index names here, ran it, and got the same error. Then I increased locationSize from 4 to 5 and this time snap successfully built the index. It took about 10 minutes, and used a max of 52GB of ram while doing it, but it built successfully. The ruby script appears to be attempting to try each locationSize between 4..8, but doesn't seem to actually be doing this. The iteration here is only successful to try a higher value if either the directory doesn't exist or the error matches one specific case, which doesn't seem to be the error I'm getting here.

I've created a pull request with a possible fix here which checks for the error message text I actually got.

I don't see an option in transrate to use a pre-existing snap index, so I'm going to try manually building on the merged_assemblies file and making sure the name matches the expected convention so re-indexing is skipped when I re-run.

jorvis commented 8 years ago

Index creation seems to work fine now, but then it fails during the snap step. Going to sleep and think on it.

[ INFO] 2016-04-06 02:02:18 : Contig metrics done in 715 seconds [ INFO] 2016-04-06 02:02:18 : Calculating read diagnostics... [ERROR] 2016-04-06 02:15:31 : Snap failed Welcome to SNAP version 1.0beta.18.

BigAllocator: allocating too much memory, 291281808 > 291281748 SNAP exited with exit code 1 from line 489 of file SNAPLib/BigAlloc.cpp

Where is this SNAPLib path? I don't find it within the release.

blahah commented 8 years ago

that file is part of SNAP itself, a c++ file so it's compile already in the release. It's here on the official SNAP repo.

Did you look at memory usage when running? Could you have run out of RAM?

blahah commented 8 years ago

The latest release of transrate v1.0.3 now includes the latest version of SNAP and Salmon, as well as your fix :).

Please update and try your analysis again. Hopefully this will solve the problem - if not please re-open the issue. Many thanks for your patience :)

jorvis commented 8 years ago

OK, so I just tried again with the latest version v1.0.3, and this was the output:

[ INFO] 2016-08-31 23:16:18 : Calculating contig metrics...
[ INFO] 2016-08-31 23:27:36 : Contig metrics:
[ INFO] 2016-08-31 23:27:36 : -----------------------------------
[ INFO] 2016-08-31 23:27:36 : n seqs                      1790973
[ INFO] 2016-08-31 23:27:36 : smallest                        100
[ INFO] 2016-08-31 23:27:36 : largest                      166145
[ INFO] 2016-08-31 23:27:36 : n bases                  2238426230
[ INFO] 2016-08-31 23:27:36 : mean len                    1249.06
[ INFO] 2016-08-31 23:27:36 : n under 200                    7858
[ INFO] 2016-08-31 23:27:36 : n over 1k                    618723
[ INFO] 2016-08-31 23:27:36 : n over 10k                     7289
[ INFO] 2016-08-31 23:27:36 : n with orf                   315711
[ INFO] 2016-08-31 23:27:36 : mean orf percent               29.3
[ INFO] 2016-08-31 23:27:36 : n90                             467
[ INFO] 2016-08-31 23:27:36 : n70                            1306
[ INFO] 2016-08-31 23:27:36 : n50                            2478
[ INFO] 2016-08-31 23:27:36 : n30                            4073
[ INFO] 2016-08-31 23:27:36 : n10                            7376
[ INFO] 2016-08-31 23:27:36 : gc                             0.43
[ INFO] 2016-08-31 23:27:36 : bases n                           0
[ INFO] 2016-08-31 23:27:36 : proportion n                    0.0
[ INFO] 2016-08-31 23:27:36 : Contig metrics done in 678 seconds
[ INFO] 2016-08-31 23:27:36 : Calculating read diagnostics...
[ WARN] 2016-08-31 23:31:49 : Snap index build failed with n = 4 , increasing +1
/local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/snap.rb:144:in `delete': Directory not empty @ dir_s_rmdir - transrate.merged.assemblies (Errno::ENOTEMPTY)
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/snap.rb:144:in `block in build_index'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/snap.rb:125:in `loop'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/snap.rb:125:in `build_index'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/read_metrics.rb:52:in `run'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/transrater.rb:98:in `read_metrics'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/cmdline.rb:508:in `read_metrics'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/cmdline.rb:404:in `block in analyse_assembly'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/cmdline.rb:400:in `chdir'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/cmdline.rb:400:in `analyse_assembly'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/cmdline.rb:38:in `block (2 levels) in run'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/cmdline.rb:37:in `zip'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/cmdline.rb:37:in `block in run'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/cmdline.rb:32:in `chdir'
        from /local/scratch/aplysia/transrate/tool/lib/app/lib/transrate/cmdline.rb:32:in `run'
        from /local/scratch/aplysia/transrate/tool/lib/app/bin/transrate:23:in `<main>'
AdamStuckert commented 7 years ago

Did you find a fix for this issue @jorvis? I'm getting this same error.

jorvis commented 7 years ago

No I didn't, and I stopped trying to use the program. If a solution is forthcoming I might give it another shot.

AdamStuckert commented 7 years ago

I'm refreshing this because I'm getting the same issue. I realize it is with SNAP, but do you have any insights @blahah? RAM should not be an issue here, since it was running on a 3 TB node. The number of bases does admittedly seem excessive, but it is a number of assemblies merged together, across a large number of experimental treatments, to see how various methods of assemblies influence downstream inferences.

[ INFO] 2017-08-29 10:39:07 : Loading assembly: /pylon5/mc3bg6p/astuck/rerun/orthofuse/all-fastas-mergedassembly/merged.fasta
[ INFO] 2017-08-29 10:54:15 : Analysing assembly: /pylon5/mc3bg6p/astuck/rerun/orthofuse/all-fastas-mergedassembly/merged.fasta
[ INFO] 2017-08-29 10:54:15 : Results will be saved in /pylon5/mc3bg6p/astuck/rerun/orthofuse/all-fastas-mergedassembly/merged/merged
[ INFO] 2017-08-29 10:54:15 : Calculating contig metrics...
[ INFO] 2017-08-29 11:16:34 : Contig metrics:
[ INFO] 2017-08-29 11:16:34 : -----------------------------------
[ INFO] 2017-08-29 11:16:34 : n seqs                      2368436
[ INFO] 2017-08-29 11:16:34 : smallest                        201
[ INFO] 2017-08-29 11:16:34 : largest                       18766
[ INFO] 2017-08-29 11:16:34 : n bases                  2172789003
[ INFO] 2017-08-29 11:16:34 : mean len                     917.39
[ INFO] 2017-08-29 11:16:34 : n under 200                       0
[ INFO] 2017-08-29 11:16:34 : n over 1k                    651133
[ INFO] 2017-08-29 11:16:34 : n over 10k                      937
[ INFO] 2017-08-29 11:16:34 : n with orf                   701643
[ INFO] 2017-08-29 11:16:34 : mean orf percent              57.12
[ INFO] 2017-08-29 11:16:34 : n90                             342
[ INFO] 2017-08-29 11:16:34 : n70                             884
[ INFO] 2017-08-29 11:16:34 : n50                            1659
[ INFO] 2017-08-29 11:16:34 : n30                            2651
[ INFO] 2017-08-29 11:16:34 : n10                            4686
[ INFO] 2017-08-29 11:16:34 : gc                             0.45
[ INFO] 2017-08-29 11:16:34 : bases n                     1219873
[ INFO] 2017-08-29 11:16:34 : proportion n                    0.0
[ INFO] 2017-08-29 11:16:34 : Contig metrics done in 1339 seconds
[ INFO] 2017-08-29 11:16:34 : Calculating read diagnostics...
[ WARN] 2017-08-29 11:19:59 : Snap index build failed with n = 4 , increasing +1
/pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/snap.rb:145:in `delete': Directory not empty @ dir_s_rmdir - merged (Errno::ENOTEMPTY)
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/snap.rb:145:in `block in build_index'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/snap.rb:126:in `loop'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/snap.rb:126:in `build_index'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/read_metrics.rb:52:in `run'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/transrater.rb:98:in `read_metrics'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:508:in `read_metrics'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:404:in `block in analyse_assembly'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:400:in `chdir'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:400:in `analyse_assembly'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:38:in `block (2 levels) in run'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:37:in `zip'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:37:in `block in run'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:32:in `chdir'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:32:in `run'
        from /pylon2/mc3bg6p/macmanes/software/transrate-1.0.3-linux-x86_64/lib/app/bin/transrate:23:in `<main>'
make: *** [/pylon5/mc3bg6p/astuck/rerun/orthofuse/all-fastas-mergedassembly/orthotransrate.done] Error 1
elijahlowe commented 6 years ago

did anyone find an answer for this problem?

rafinhacp commented 4 years ago

Hi, I'm trying to run Transrate_v1.0.3 (latest version as far as I'm concerned) on my transcriptome assembly and I'm getting the same issue as @AdamStuckert and @jorvis. Does anyone knows how to fix this?

[ INFO] 2019-11-29 15:48:45 : Loading assembly: /media/raid/raperez/transcriptomes151/data-rafaela/trinityOUT-RMR/Trinity.fasta [ INFO] 2019-11-29 16:10:24 : Analysing assembly: /media/raid/raperez/transcriptomes151/data-rafaela/trinityOUT-RMR/Trinity.fasta [ INFO] 2019-11-29 16:10:24 : Results will be saved in /media/raid/raperez/transcriptomes151/data-rafaela/transrate_RMR/Trinity [ INFO] 2019-11-29 16:10:24 : Calculating contig metrics... [ INFO] 2019-11-29 16:41:26 : Contig metrics: [ INFO] 2019-11-29 16:41:26 : ----------------------------------- [ INFO] 2019-11-29 16:41:26 : n seqs 2747791 [ INFO] 2019-11-29 16:41:26 : smallest 165 [ INFO] 2019-11-29 16:41:26 : largest 101616 [ INFO] 2019-11-29 16:41:26 : n bases 1681598978 [ INFO] 2019-11-29 16:41:26 : mean len 611.91 [ INFO] 2019-11-29 16:41:26 : n under 200 1061 [ INFO] 2019-11-29 16:41:26 : n over 1k 375423 [ INFO] 2019-11-29 16:41:26 : n over 10k 1083 [ INFO] 2019-11-29 16:41:26 : n with orf 136230 [ INFO] 2019-11-29 16:41:26 : mean orf percent 41.29 [ INFO] 2019-11-29 16:41:26 : n90 265 [ INFO] 2019-11-29 16:41:26 : n70 463 [ INFO] 2019-11-29 16:41:26 : n50 833 [ INFO] 2019-11-29 16:41:26 : n30 1503 [ INFO] 2019-11-29 16:41:26 : n10 3629 [ INFO] 2019-11-29 16:41:26 : gc 0.44 [ INFO] 2019-11-29 16:41:26 : bases n 0 [ INFO] 2019-11-29 16:41:26 : proportion n 0.0 [ INFO] 2019-11-29 16:41:26 : Contig metrics done in 1862 seconds [ INFO] 2019-11-29 16:41:26 : Calculating read diagnostics... [ WARN] 2019-11-29 16:52:29 : Snap index build failed with n = 4 , increasing +1 /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/snap.rb:144:in delete': Directory not empty @ dir_s_rmdir - Trinity (Errno::ENOTEMPTY) from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/snap.rb:144:inblock in build_index' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/snap.rb:125:in loop' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/snap.rb:125:inbuild_index' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/read_metrics.rb:52:in run' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/transrater.rb:98:inread_metrics' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:508:in read_metrics' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:404:inblock in analyse_assembly' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:400:in chdir' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:400:inanalyse_assembly' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:38:in block (2 levels) in run' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:37:inzip' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:37:in block in run' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:32:inchdir' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/lib/transrate/cmdline.rb:32:in run' from /home/raperez/sw/TRANSRATE_v1.0.3/transrate-1.0.3-linux-x86_64/lib/app/bin/transrate:23:in

'

Thank you in advance!

AdamStuckert commented 4 years ago

I never was able to "fix" this per se @rafinhacp. That said, two suggestions.

  1. Restart this. Sometimes it just gets hung here and a restart works.
  2. Use a smaller input. I'm not sure what you are doing, but it looks large. In my experience, this causes problems. You could fix this by trying to subsample your data, or rethinking your assembly approach so there is less variability in the input data.
rafinhacp commented 4 years ago

Thanks @AdamStuckert. I'll rethink how to go about it, giving that rerun is not working.

nbat64 commented 8 months ago

Hi, is anyone get a fix for this problem? Have the same issue as @rafinhacp for one sample. Thanks.