NUStatBioinfo / DegNorm

Normalizing RNA degradation in RNA-seq data
https://nustatbioinfo.github.io/DegNorm/
3 stars 1 forks source link

Memory issues with mpi4py and pickle #45

Open alyxgray7 opened 2 years ago

alyxgray7 commented 2 years ago

Hello,

Thank you for providing this tool - we'll be working with 141 highly degraded (majority are RIN < 5.0) human RNAseq samples that would really benefit from this normalization approach. In preparation, we've been working with a publicly available RNAseq dataset (90 samples - GSE68086) and have run into a few problems testing the DegNorm MPI module. Even after reducing the number of samples to 6, we have run into the same set of memory errors.

Each HPC workflow was submitted across 4 nodes. We've tried submitting at a max of 20 cores/node and 264 Gb mem. Each job processes for ~ 14 hours and then errors out around the same spot with these two messages. I'm happy to provide you the full log files if that's helpful.

### First example error type
DegNorm MPI (11/11/2021 06:05:06) ---- (1/4) -- Coverage merge successful. Number of loaded coverage matrices: 57028
DegNorm MPI (11/11/2021 06:06:05) ---- (1/4) -- Saving gene-exon metadata to ./degnorm_11092021_154654/gene_exon_metadata.csv
DegNorm MPI (11/11/2021 06:06:07) ---- (1/4) -- Saving original read counts to ./degnorm_11092021_154654/read_counts.csv
DegNorm MPI (11/11/2021 06:06:09) ---- (1/4) -- Determining genes to include in DegNorm coverage curve approximation.
DegNorm MPI (11/11/2021 06:07:02) ---- (1/4) -- DegNorm will run on 34313 genes, downsampling rate = 1 / 1, with baseline selection.
Traceback (most recent call last):
  File "/home/exacloud/software/spack/opt/spack/linux-centos7-ivybridge/gcc-8.3.1/py-degnorm-master-dxsa7colkcqyigrffo2b6d2hyh4o6zhr/bin/degnorm_mpi", line 34, in <module>
    sys.exit(load_entry_point('DegNorm==0.1.4', 'console_scripts', 'degnorm_mpi')())
  File "/home/exacloud/software/spack/opt/spack/linux-centos7-ivybridge/gcc-8.3.1/py-degnorm-master-dxsa7colkcqyigrffo2b6d2hyh4o6zhr/lib/python3.6/site-packages/degnorm/__main_mpi__.py", line 402, in main
    pkl.dump(gene_cov_dict, f)
MemoryError

### Second example error type
DegNorm MPI (11/18/2021 08:31:07) ---- (1/5) -- Coverage merge successful. Number of loaded coverage matrices: 57028
DegNorm MPI (11/18/2021 08:33:19) ---- (1/5) -- Saving gene-exon metadata to ./degnorm_11162021_210839/gene_exon_metadata.csv
DegNorm MPI (11/18/2021 08:33:21) ---- (1/5) -- Saving original read counts to ./degnorm_11162021_210839/read_counts.csv
DegNorm MPI (11/18/2021 08:33:24) ---- (1/5) -- Determining genes to include in DegNorm coverage curve approximation.
DegNorm MPI (11/18/2021 08:34:40) ---- (1/5) -- DegNorm will run on 34313 genes, downsampling rate = 1 / 1, with baseline selection.
DegNorm MPI (11/18/2021 08:53:43) ---- (1/5) -- Begin executing NMFOA algorithm...
DegNorm MPI (11/18/2021 08:53:44) ---- (1/5) -- host will be responsible for 6863 genes.
DegNorm MPI (11/18/2021 08:53:44) ---- (1/5) -- worker node 1 will be responsible for 6863 genes.
Traceback (most recent call last):
  File "/home/exacloud/software/spack/opt/spack/linux-centos7-ivybridge/gcc-8.3.1/py-degnorm-master-dxsa7colkcqyigrffo2b6d2hyh4o6zhr/bin/degnorm_mpi", line 34, in <module>
    sys.exit(load_entry_point('DegNorm==0.1.4', 'console_scripts', 'degnorm_mpi')())
  File "/home/exacloud/software/spack/opt/spack/linux-centos7-ivybridge/gcc-8.3.1/py-degnorm-master-dxsa7colkcqyigrffo2b6d2hyh4o6zhr/lib/python3.6/site-packages/degnorm/__main_mpi__.py", line 436, in main
    , skip_baseline_selection=args.skip_baseline_selection)
  File "/home/exacloud/software/spack/opt/spack/linux-centos7-ivybridge/gcc-8.3.1/py-degnorm-master-dxsa7colkcqyigrffo2b6d2hyh4o6zhr/lib/python3.6/site-packages/degnorm/nmf_mpi.py", line 629, in run_gene_nmfoa_mpi
    , tag=333 + worker_id)
  File "mpi4py/MPI/Comm.pyx", line 1156, in mpi4py.MPI.Comm.send
  File "mpi4py/MPI/msgpickle.pxi", line 173, in mpi4py.MPI.PyMPI_send
  File "mpi4py/MPI/msgpickle.pxi", line 106, in mpi4py.MPI.Pickle.dump
  File "mpi4py/MPI/msgbuffer.pxi", line 44, in mpi4py.MPI.downcast
OverflowError: integer 17879689247 does not fit in 'int'

DegNorm command being used:

degnorm_mpi --bam-dir $DATADIR -g $GTF -o $OUTDIR -p 22 --nmf-iter 100 --minimax-coverage 5

Thank you!

Alyx @FiReTiTi

marce-sarrias commented 10 months ago

Hello,

Following this thread. I encountered some issues when attempting to run the MPI version, specifically related to package compatibility. It seemed like DegNorm had a preference for Python 3.6, which didn't align with the Python version available on our clusters. Have you faced similar issues, and if so, how did you manage to resolve them?

At the end, I opted to use the standard version but it takes really long time!

Thank you!

Marcela

alyxgray7 commented 9 months ago

Hi Marcela @marce-sarrias,

We actually had our HPC staff build an environment that could run the MPI with a compatible Python version. Unfortunately, this was ultimately unsuccessful due to the memory issues that we were never able to resolve...

If memory serves me correctly, the standard version would also error out from the same resource issues. Our single successful run (I think with only 3 samples?) also took a very very long time! I'm glad to hear the standard version is working okay for you.

Best of luck, Alyx