foerstner-lab / READemption

A pipeline for the computational evaluation of RNA-Seq data
https://reademption.readthedocs.io
Other
36 stars 19 forks source link

KeyError at alignment stats step #52

Closed gprezza closed 11 months ago

gprezza commented 11 months ago

I am using version 2.0.3

reademption align -p 40 --poly_a_clipping -f analysis_reademption -q -g -F

Traceback (most recent call last):
  File "/home/gprezza/bin/anaconda3/envs/reademption/bin/reademption", line 715, in <module>
    main()
  File "/home/gprezza/bin/anaconda3/envs/reademption/bin/reademption", line 22, in main
    args.func(controller)
  File "/home/gprezza/bin/anaconda3/envs/reademption/bin/reademption", line 687, in align_reads
    controller.align_reads()
  File "/home/gprezza/bin/anaconda3/envs/reademption/lib/python3.9/site-packages/reademptionlib/controller.py", line 245, in align_reads
    self._write_alignment_stat_table(
  File "/home/gprezza/bin/anaconda3/envs/reademption/lib/python3.9/site-packages/reademptionlib/controller.py", line 583, in _write_alignment_stat_table
    read_aligner_stats_table.write()
  File "/home/gprezza/bin/anaconda3/envs/reademption/lib/python3.9/site-packages/reademptionlib/readalignerstatstable.py", line 36, in write
    all_stats = self._create_table_all_statistics()
  File "/home/gprezza/bin/anaconda3/envs/reademption/lib/python3.9/site-packages/reademptionlib/readalignerstatstable.py", line 43, in _create_table_all_statistics
    total_stats = self._create_statistics_table_total()
  File "/home/gprezza/bin/anaconda3/envs/reademption/lib/python3.9/site-packages/reademptionlib/readalignerstatstable.py", line 56, in _create_statistics_table_total
    self._get_read_process_number(lib, "total_no_of_reads"),
  File "/home/gprezza/bin/anaconda3/envs/reademption/lib/python3.9/site-packages/reademptionlib/readalignerstatstable.py", line 578, in _get_read_process_number
    return self._read_processing_stats[lib][attribute] * factor
KeyError: 'B_theta_AT_bio1'

This is from a run starting from a previously interrupted one that had generated the alignments but run out of memory at the anlignment stats step. I simply resubmitted by assigning more memory to the job (I'm working on an HPC cluster where memory has to be booked when submitting a job). Given the high memory requirements I'm unable to do a clean run as of now, but since the alignment step ended succesfully, I don't think the interrupted run should have been an issue?

Related to this, the alignment stats step used 300 GB of RAM before the error above, which seems a bit excessive, although I do have a lot of reads (800 million total, in 21 samples). Is reademption loading all alignments in memory at the same time to calculate the stats?

Tillsa commented 11 months ago

Hi @gprezza!

the interrupted run might indeed be an issue if it got interrupted before the alignment finished. Do you have bam files in your alignments folder for each of your input read libraries? I would suggest that you rerun the entire analysis (inlcuding creating a new reademption analysis project with the "create" subcommand) We are aware of the current alignemnt stats module needing a lot of memory (especially if you have such a high number (800 million) of reads), and will try to minimize memory consumption in the future.

Best wishes, Till

gprezza commented 11 months ago

Hi Till, thanks for the reply.

Yes, all bam files are there, including the B_theta_AT_bio1 one. I also deleted all files in output/align/reports_and_stats/ and output/align/reports_and_stats/stats_data_json/ before re-running.

I guess the best option is to do a clean run as you suggest. I'll hopefully manage to make it run over the weekend when more memory should be available on our cluster, I'll keep you updated.

gprezza commented 11 months ago

Probably unsurprisingly, a fresh re-run completed succesfully. Let's count this issue as my two cents in favour of reducing the memory consumption of the alignment stats module. :)

Cheers, Gianluca