ablab / rnaquast

Quality assessment of de novo transcriptome assemblies from RNA-Seq data
http://cab.spbu.ru/software/rnaquast
Other
19 stars 6 forks source link

MemoryError #12

Closed dcandiah closed 2 years ago

dcandiah commented 2 years ago

Hello!

I'm trying to check the quality of my De Novo Assembly with rnaQUAST installed with conda. When it is getting the reference genome this happend:


System information: OS: Linux-5.11.0-40-generic-x86_64-with-glibc2.31 (linux_64) Python version: 3.9.7 CPUs number: 16

External tools: matplotlib: 3.5.0 joblib: 1.1.0 gffutils: 0.10.1 blastn: 2.12.0+ makeblastdb: 2.12.0+ gmap: 2021-08-25

Started: 2021-11-23 11:14:06

Logging to /rnaQUAST_results/logs/rnaQUAST.log

2021-11-23 11:14:06 Getting reference...

Traceback (most recent call last): File "/bin/rnaQUAST.py", line 348, in return_code = main_utils() File "/bin/rnaQUAST.py", line 98, in main_utils reference_dict = UtilsGeneral.list_to_dict(fastaparser.read_fasta(args.reference)) File "/share/rnaquast-2.2.1-0/general/UtilsGeneral.py", line 105, in list_to_dict for e in l: File "/share/rnaquast-2.2.1-0/quast_libs/fastaparser.py", line 202, in read_fasta yield name, "".join(seq) MemoryError

ERROR! Exception caught!


My reference genome has 53GB of size. So my question is:

What are the minimums requirements do I need ?

Thanks !

andrewprzh commented 2 years ago

Dear Daniel

I have never worked with reference genomes of this size. Just out of curiosity, what is this genome?

I cannot say for sure but my guess it will be loaded in the memory entirely. So 64 Gb may still not be enough (considering other data structures etc).

I'll try to check the code and probably avoid loading it to RAM. However, I cannot be sure other tools like GMAP will not fail on reference of such size.

Best Andrey

dcandiah commented 2 years ago

Thanks Andrey

I'm working with Zebrafish reference genome.

I tried a lot of times and the program failed before aligning.

Maybe on january I'll have acces to 180GB of RAM, I'll hope works on that time.

Thanks for your response, I'm new and sometimes I need help to understand the errors.

Best Regards Daniel

andrewprzh commented 2 years ago

At some point I also worked with Zebrafish, but I recall this genome was much smaller https://www.ensembl.org/Danio_rerio/Info/Annotation

Could you point out which genome version you've used?

dcandiah commented 2 years ago

I'm using Danio_rerio.GRCz11.dna.toplevel.fa.gz, Here

andrewprzh commented 2 years ago

Yes, this file contains all patches and haplotypes - see README file located in the folder you indicated. You don't need these for running rnaQUAST (and all other tools as well). Please, use dna.primary_assembly as suggested in the README:

Primary assembly contains all toplevel sequence regions excluding haplotypes and patches. This file is best used for performing sequence similarity searches where patch and haplotype sequences would confuse analysis. If the primary assembly file is not present, that indicates that there are no haplotype/patch regions, and the 'toplevel' file is equivalent.

I.e. this one http://ftp.ensembl.org/pub/release-104/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz

dcandiah commented 2 years ago

Thanks, you were right

rnaQUAST could calculate more than before, but unfortunately dies in a posterior stage.


Starting alignment No paths found for TRINITY_DN10099_c0_g1_i1 . . . No paths found for TRINITY_DN114570_c0_g1_i2 Signal received: SIGSEGV Calling Access_emergency_cleanup Problem sequence: TRINITY_DN68309_c5_g1_i1 (282 bp)


I guess I need more RAM.

andrewprzh commented 2 years ago

Could you show full log (rnaQUAST.log), please?

dcandiah commented 2 years ago

The rnaQUAST.log is:

/rnaQUAST-2.2.1/rnaQUAST.py --transcripts /transcripts/DeNovoZebrafishMapeados.fasta --reference /Danio_rerio.GRCz11.dna.primary_assembly.fa --gtf /Danio_rerio.GRCz11.104.gtf --gmap_index /DB_Directory/Zebrafish_primary_DB --output_dir /rnaQUAST_results --disable_infer_genes --disable_infer_transcripts -t 12

rnaQUAST: 2.2.1

System information: OS: Linux-5.11.0-40-generic-x86_64-with-glibc2.29 (linux_64) Python version: 3.8.10 CPUs number: 16

External tools: matplotlib: 3.5.0 joblib: 1.1.0 gffutils: 0.10.1 blastn: 2.12.0+ makeblastdb: 2.12.0+ gmap: 2021-08-25

Started: 2021-11-29 22:07:48

Logging to /rnaQUAST_results/logs/rnaQUAST.log

2021-11-29 22:07:48 Getting reference... Done. Using non strand specific transcripts...

2021-11-29 22:07:55 Loading sqlite3 db by gffutils from /rnaQUAST_results/Danio_rerio.GRCz11.104.db to memory... Done.

2021-11-29 22:08:05 Getting GENE DATABASE metrics... Done.

Sets maximum intron size equal 677718. Default is 1500000 bp.

2021-11-29 22:08:45 Sorting exons attributes...

WARNING: Number of chromosomes / scaffolds more than 100. Sorted in 1. Sorted in 10. Sorted in 11. Sorted in 12. Sorted in 13. Sorted in 14. Sorted in 15. Sorted in 16. Sorted in 17. Sorted in 18. Sorted in 19. Sorted in 2. Sorted in 20. Sorted in 21. Sorted in 22. Sorted in 23. Sorted in 24. Sorted in 25. Sorted in 3. Sorted in 4. Sorted in 5. Sorted in 6. Sorted in 7. Sorted in 8. Sorted in 9. Sorted in MT. Sorted in KN149696.2. Sorted in KN147651.2. Sorted in KN149690.1. Sorted in KN149686.1. (a lot of messages like this) Sorted in KN150525.1. Done.

2021-11-29 22:09:03 Getting transcripts from /DeNovoZebrafishMapeados.fasta... Done.

2021-11-29 22:09:05 Getting upper case fasta... saved to /rnaQUAST_results/tmp/Danio_rerio.GRCz11.dna.primary_assembly.upper.fa

2021-11-29 22:09:10 Aligning DeNovoZebrafishMapeados to Danio_rerio.GRCz11.dna.primary_assembly.upper... log can be found in /rnaQUAST_results/logs/gmap.DeNovoZebrafishMapeados.err.log.


The err.log is:

GMAP version 2021-08-25 called with args: gmap.avx512 -D /rnaQUAST_results/tmp -d Danio_rerio.GRCz11.dna.primary_assembly.upper /EnsambleFinal/DeNovoZebrafishMapeados.fasta --format=1 -t 12 -O Checking compiler assumptions for SSE2: 6B8B4567 327B23C6 xor=59F066A1 Checking compiler assumptions for SSE4.1: -103 -58 max=198 => compiler zero extends Checking compiler options for SSE4.2: 6B8B4567 __builtin_clz=1 builtin_ctz=0 _mm_popcnt_u32=17 builtin_popcount=17 Finished checking compiler assumptions Pre-loading compressed genome (oligos)......done (515,051,780 bytes, 125746 pages, 0.00 sec) Looking for genome Zebrafish_primary_DB in directory /rnaQUAST_results/tmp/Danio_rerio.GRCz11.dna.primary_assembly.upper Looking for index files in directory /rnaQUAST_results/tmp/Danio_rerio.GRCz11.dna.primary_assembly.upper Pointers file is Zebrafish_primary_DB.ref153offsets64meta Offsets file is Zebrafish_primary_DB.ref153offsets64strm Positions file is Zebrafish_primary_DB.ref153positions Offsets compression type: bitpack64 Allocating memory for ref offset pointers, kmer 15, interval 3...done (134,217,744 bytes, 0.03 sec) Allocating memory for ref offsets, kmer 15, interval 3...done (417,067,168 bytes, 0.10 sec) Pre-loading ref positions, kmer 15, interval 3......done (1,824,651,936 bytes, 0.01 sec) Starting alignment No paths found for TRINITY_DN10099_c0_g1_i1 No paths found for TRINITY_DN10328_c4_g1_i1 No paths found for TRINITY_DN10663_c0_g1_i2 No paths found for TRINITY_DN108272_c0_g2_i2 No paths found for TRINITY_DN108917_c1_g1_i1 No paths found for TRINITY_DN112556_c0_g1_i1 No paths found for TRINITY_DN114730_c1_g3_i1 No paths found for TRINITY_DN115251_c0_g1_i1 No paths found for TRINITY_DN116086_c0_g1_i2 No paths found for TRINITY_DN119274_c2_g1_i1 No paths found for TRINITY_DN1220_c8_g1_i1 No paths found for TRINITY_DN122183_c0_g1_i1 No paths found for TRINITY_DN122545_c0_g1_i3 No paths found for TRINITY_DN125788_c1_g1_i2 No paths found for TRINITY_DN12875_c0_g1_i4 No paths found for TRINITY_DN1296_c26_g1_i10 No paths found for TRINITY_DN129850_c0_g1_i3 No paths found for TRINITY_DN139407_c2_g1_i1 No paths found for TRINITY_DN14200_c6_g1_i1 No paths found for TRINITY_DN14441_c1_g3_i1 No paths found for TRINITY_DN148457_c0_g1_i1 No paths found for TRINITY_DN150636_c1_g1_i4 No paths found for TRINITY_DN177933_c1_g1_i1 No paths found for TRINITY_DN17868_c6_g1_i1 No paths found for TRINITY_DN182463_c4_g1_i1 No paths found for TRINITY_DN18635_c5_g1_i3 No paths found for TRINITY_DN19457_c0_g1_i1 No paths found for TRINITY_DN201858_c5_g1_i1 No paths found for TRINITY_DN20345_c0_g1_i2 No paths found for TRINITY_DN21066_c0_g1_i2 No paths found for TRINITY_DN21199_c4_g1_i1 No paths found for TRINITY_DN212706_c0_g1_i1 No paths found for TRINITY_DN224964_c0_g1_i1 No paths found for TRINITY_DN22_c13_g1_i6 No paths found for TRINITY_DN232957_c1_g1_i1 No paths found for TRINITY_DN241732_c0_g1_i1 No paths found for TRINITY_DN24599_c0_g1_i2 No paths found for TRINITY_DN252198_c4_g1_i1 No paths found for TRINITY_DN252652_c0_g1_i1 No paths found for TRINITY_DN252652_c0_g1_i4 No paths found for TRINITY_DN2539_c0_g1_i11 No paths found for TRINITY_DN28328_c0_g1_i3 No paths found for TRINITY_DN2931_c3_g1_i1 No paths found for TRINITY_DN29692_c6_g1_i1 No paths found for TRINITY_DN3025_c1_g1_i6 No paths found for TRINITY_DN34840_c0_g2_i1 No paths found for TRINITY_DN35282_c0_g1_i1 No paths found for TRINITY_DN3584_c1_g2_i1 No paths found for TRINITY_DN35880_c5_g1_i1 No paths found for TRINITY_DN36843_c4_g1_i1 No paths found for TRINITY_DN37734_c5_g2_i1 No paths found for TRINITY_DN4001_c1_g1_i5 No paths found for TRINITY_DN4143_c21_g1_i1 No paths found for TRINITY_DN43834_c6_g1_i1 No paths found for TRINITY_DN4523_c0_g1_i13 No paths found for TRINITY_DN46862_c0_g2_i1 No paths found for TRINITY_DN48084_c1_g1_i6 No paths found for TRINITY_DN5057_c0_g1_i2 No paths found for TRINITY_DN50718_c0_g2_i1 No paths found for TRINITY_DN51781_c0_g1_i2 No paths found for TRINITY_DN53384_c1_g1_i3 No paths found for TRINITY_DN5573_c17_g1_i2 No paths found for TRINITY_DN6067_c0_g2_i2 No paths found for TRINITY_DN63122_c0_g1_i3 No paths found for TRINITY_DN63956_c2_g1_i7 No paths found for TRINITY_DN65794_c0_g1_i1 No paths found for TRINITY_DN68967_c1_g2_i1 No paths found for TRINITY_DN7360_c1_g1_i1 No paths found for TRINITY_DN7416_c0_g1_i5 No paths found for TRINITY_DN75064_c0_g2_i1 No paths found for TRINITY_DN7507_c2_g1_i1 No paths found for TRINITY_DN7735_c0_g1_i2 No paths found for TRINITY_DN83597_c3_g1_i1 No paths found for TRINITY_DN86964_c2_g1_i1 No paths found for TRINITY_DN90474_c2_g1_i1 No paths found for TRINITY_DN905_c0_g1_i2 No paths found for TRINITY_DN91091_c2_g1_i1 No paths found for TRINITY_DN91197_c0_g1_i1 No paths found for TRINITY_DN95541_c1_g2_i1 No paths found for TRINITY_DN99610_c1_g5_i1 No paths found for TRINITY_DN9982_c0_g1_i1 No paths found for TRINITY_DN66270_c0_g1_i1 No paths found for TRINITY_DN126135_c1_g1_i1 No paths found for TRINITY_DN139077_c0_g2_i1 No paths found for TRINITY_DN114570_c0_g1_i2 Signal received: SIGSEGV Calling Access_emergency_cleanup Problem sequence: TRINITY_DN68309_c5_g1_i1 (282 bp)

I tried with GMAP 2020-10-14 too, like here #7, and the same happens

dcandiah commented 2 years ago

Hello again

As my cousin suggested I've changed my two 32 GB RAM, from slots 1 and 3 to 2 and 4, then the problem was solved. I'm not sure what has happend. I thinking is not the position change, but maybe the cards were wrong connected.

Thanks for your time and thanks again for the software, its so cool all metrics that I have now :)

Daniel

andrewprzh commented 2 years ago

Glad everything turned out well!