"Cannot allocate memory" during stringtie

ktpolanski commented 9 months ago

Operating System

Other Linux (please specify below)

Other Linux

18.04.4

Workflow Version

v1.0.1-ga6a1b69

Workflow Execution

Command line

EPI2ME Version

No response

CLI command run

~/nextflow-23.12.0-edge-all run epi2me-labs/wf-single-cell \
    --fastq fastq/ \
    --kit_name multiome \
    --kit_version v1 \
    --expected_cells 5000 \
    --ref_genome_dir /home/ubuntu/cellranger/GRCh38-2020-A/ \
    --sample $SAMPLE \
    -c openstack.cfg \
    -profile standard

Workflow Execution - CLI Execution Profile

standard (default)

What happened?

I'm running the workflow locally on an OpenStack instance with a decent number of cores and RAM. The stringtie process appears to somehow short the instance out on resources in a strange way. Once my first sample got tanked by this, I went into the docs and found the recommendation to make a local config. As such, I did so, with fewer cores (23 vs 26) and RAM (200GB vs ~220GB) than the instance has available. Three of the four samples ran fine, including past the stringtie step, but one kept snagging there repeatedly.

I was able to get everything across the finish line by -resume'ing the pipeline, sometimes repeatedly. Still, it would be nice to not have to babysit this.

Attached below is the .command.err for an example erroring job.

Relevant log output

WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
[15:30:45 - matplotlib.font_manager] generated new fontManager
/home/epi2melabs/conda/lib/python3.8/site-packages/umap/distances.py:1063: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/epi2melabs/conda/lib/python3.8/site-packages/umap/distances.py:1071: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/epi2melabs/conda/lib/python3.8/site-packages/umap/distances.py:1086: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
/home/epi2melabs/conda/lib/python3.8/site-packages/umap/umap_.py:660: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit()
[15:30:58 - workflow_glue] Starting entrypoint.
Traceback (most recent call last):
  File "/home/ubuntu/.nextflow/assets/epi2me-labs/wf-single-cell/bin/workflow_glue/process_bam_for_stringtie.py", line 41, in main
    bam_out.write(align)
  File "pysam/libcalignmentfile.pyx", line 1708, in pysam.libcalignmentfile.AlignmentFile.write
  File "pysam/libcalignmentfile.pyx", line 1740, in pysam.libcalignmentfile.AlignmentFile.write
OSError: sam_write1 failed with error code -1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/.nextflow/assets/epi2me-labs/wf-single-cell/bin/workflow-glue", line 7, in <module>
    cli()
  File "/home/ubuntu/.nextflow/assets/epi2me-labs/wf-single-cell/bin/workflow_glue/__init__.py", line 72, in cli
    args.func(args)
  File "/home/ubuntu/.nextflow/assets/epi2me-labs/wf-single-cell/bin/workflow_glue/process_bam_for_stringtie.py", line 41, in main
    bam_out.write(align)
  File "pysam/libcalignmentfile.pyx", line 1750, in pysam.libcalignmentfile.AlignmentFile.__exit__
  File "pysam/libcalignmentfile.pyx", line 1682, in pysam.libcalignmentfile.AlignmentFile.close
OSError: [Errno 12] Cannot allocate memory
[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1_sam] Parse error at line 1762314
[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1_sam] Parse error at line 1762314
samtools bam2fq: Failed to read bam record
samtools bam2fq: Error writing to FASTx files.: No such file or directory
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 1508907 reads

Application activity log entry

No response

nrhorner commented 9 months ago

Hi @ktpolanski

Sorry that you're having issues with the workflow. The error OSError: sam_write1 failed with error code -1 is sometimes seen when there no disk space left. I'm not sure this is the issue as -resume ing worked.. Could you please check you have enough disk space and that there is space left in $TMPDIR.

ktpolanski commented 9 months ago

I definitely had space on the drive itself. My $TMPDIR is currently unset. I guess I can try the export TMPDIR = /some/path/on/large/drive suggestion that came up in a different issue if I run into this again?

nrhorner commented 9 months ago

Yes, that's what I would try first. Please let me know if that does not work for you.

ktpolanski commented 9 months ago

So I tried a combination of things - dialled back the resource use further in the config in case that matters, and added the TMPDIR as suggested, just ahead of the nextflow call:

export TMPDIR=/mnt/tmpdir

~/nextflow-23.12.0-edge-all run epi2me-labs/wf-single-cell \
    --fastq fastq/ \
    --kit_name multiome \
    --kit_version v1 \
    --expected_cells 5000 \
    --ref_genome_dir /home/ubuntu/cellranger/GRCh38-2020-A/ \
    --sample $SAMPLE \
    -c openstack.cfg \
    -profile standard

Unfortunately I encountered the same error again. I'd like to think it's not space related because the drive has over a terabyte free on it right now:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
[...]
/dev/vdb        2.0T  790G  1.1T  42% /mnt
[...]

ktpolanski commented 7 months ago

Of note, I'm rerunning the samples again with a slightly modded (#85) v1.1.0 under singularity, and the stringies all went through without any sort of hiccupping.

epi2me-labs / wf-single-cell