"Too many open files" error when working with a scRNA-seq dataset

kotai-michael commented 5 months ago

Hello.

I tried to reproduce the MAESTER publication but I hit some issues on having too many open file limit on a scRNA-seq dataset.

My steps are,

download the dataset
run with Cell Ranger
filter the bam file with mito gene only 1-SRR15598773_chrM_bam.bam
filter the reference fasta file to have mito gene only GRCh38_chrM.fa
run maegatk with the following parameters,
- maegatk bcall -i 1-SRR15598773_chrM_bam.bam -o maegatk_outs1 -n SRR15598773-scrna -g GRCh38_chrM.fa -c 12 -bt CB -qc -ub UB -jm 60000m -mr 3 -z

However, there were some errors when splitting the bam files.

Traceback (most recent call last):
  File "/home/michael/.local/lib/python3.10/site-packages/maegatk/bin/python/split_barcoded_bam.py", line 64, in <module>
    with multi_file_manager(bambcfiles) as fopen:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/michael/.local/lib/python3.10/site-packages/maegatk/bin/python/split_barcoded_bam.py", line 56, in multi_file_manager
    files = [pysam.AlignmentFile(file, "wb", template = temp) for file in files]
  File "/home/michael/.local/lib/python3.10/site-packages/maegatk/bin/python/split_barcoded_bam.py", line 56, in <listcomp>
    files = [pysam.AlignmentFile(file, "wb", template = temp) for file in files]
  File "pysam/libcalignmentfile.pyx", line 748, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 921, in pysam.libcalignmentfile.AlignmentFile._open
OSError: [Errno 24] could not open alignment file `/home/michael/maegatk_outs/temp/barcoded_bams/TTACTGTTCGGAGTGA-1.bam`: Too many open files
[E::hts_open_format] Failed to open file "/home/michael/maegatk_outs/temp/barcoded_bams/GCCAACGAGGTAGACC-1.bam" : Too many open files

I've already tried to increase the open file limit in the system to 1048576 but still didn't work. Any suggestions on how to filter the cell barcodes or any ways to solve the issues would be appreciated.

Michael

petervangalen commented 5 months ago

I've not encountered this error before.

Are you able to successfully run the test data?
You might try filtering for high-quality cell barcodes detected in the scRNA-seq data.

kotai-michael commented 5 months ago

@petervangalen Thanks for the suggestion. I tried to run the test data and it seemed to be finished successfully but I couldn't see the final rds output. There seems some errors in the Snakemake scripts. Here is the logs.

noranekonobokkusu commented 5 months ago

Hi @kotai-michael, Can you go through your output files as described here and describe what files were generated? Also, can you provide the log file for Scatter from the logs/ folder? It might contain more informative messages if it crashed at the Scatter step.

kotai-michael commented 5 months ago

@noranekonobokkusu Thanks for the reply.

These are the created folders. Snakemake logs also attached. Unfortunately I couldn't see much info in the snakemake logs neither.

tests/test_maester/fasta:

tests/test_maester/final:
barcodeQuants.tsv  chrM_refAllele.txt  maegatk.depthTable.txt  passingBarcodes.tsv

tests/test_maester/logs:
base.maegatk.log  filterlogs  maegatk.parameters.txt  maegatk.snakemake_gather.log  maegatk.snakemake_scatter.log  maegatk.snakemake_scatter.stats  rmdupslogs

tests/test_maester/qc:
depth  quality

tests/test_maester/temp:
barcoded_bams  quality  ready_bam  scattered.allSamples.txt  sparse_matrices  temp_bam

tests/test_maester/qc/depth:
ACAATTAGCACT-1.depth.txt  CCCGTCGTGGTA-1.depth.txt  GACCGTGCATTT-1.depth.txt  GGGTTCGCCTCC-1.depth.txt  TGGTAGTGAACC-1.depth.txt
CCACAAAACATG-1.depth.txt  CTAACCCGGAAT-1.depth.txt  GGCGCTAATGAA-1.depth.txt  GTACCCACAGCC-1.depth.txt  TTCCTACGCAAT-1.depth.txt

tests/test_maester/qc/quality:

2024-06-06T110703.560838.snakemake.log 2024-06-06T110703.563637.snakemake.log

noranekonobokkusu commented 5 months ago

Can you re-run the test dataset from scratch, deleting the entire output folder?
After that, can you provide the content of folders inside tests/test_maester/temp, with their sizes? e.g. with a command
```
ls -lart *
```
My system usually generates snakemake.log files called by .Scatter and .Gather. I wonder why your system doesn't do that, these log files are not very informative indeed, but we can try to trace down the issue by exploring what intermediate files fail to generate.

kotai-michael commented 5 months ago

@noranekonobokkusu Thanks for the reply. I deleted the whole output folder and typed the command you suggested. This is the result.

temp_folder.txt

I was getting the snakemake log files from .snakemake/log folder. Those didn't have the .Scatter nor .Gather. I found that there were another log folder inside the output folder but the snakemake logs seemed to be emptied ? Maybe snakemake scripts didn't execute properly ?

tests/test_maester/logs$ ls -lah
total 56K
drwxrwxr-x 4 kotai kotai 4.0K  6月  6 14:02 .
drwxrwxr-x 8 kotai kotai 4.0K  6月  6 14:02 ..
-rw-rw-r-- 1 kotai kotai  440  6月  6 14:02 base.maegatk.log
drwxrwxr-x 2 kotai kotai 4.0K  6月  6 14:02 filterlogs
-rw-rw-r-- 1 kotai kotai  415  6月  6 14:02 maegatk.parameters.txt
-rw-rw-r-- 1 kotai kotai    0  6月  6 14:02 maegatk.snakemake_gather.log
-rw-rw-r-- 1 kotai kotai    0  6月  6 14:02 maegatk.snakemake_scatter.log
-rw-rw-r-- 1 kotai kotai  32K  6月  6 14:02 maegatk.snakemake_scatter.stats
drwxrwxr-x 2 kotai kotai 4.0K  6月  6 14:02 rmdupslogs

My Python version is 3.10.12 and R version is 4.1.2.

noranekonobokkusu commented 5 months ago

From the content of your temp/ folder, I can say the Scatter step finished successfully, you have all the expected files (can you confirm you have non-empty files in test_maester/qc/depth/?). It's either Gather that simply aggregates files together, or R, that's causing the issue. What's the content of the final/ folder? If you have non-empty .gz files but not the .rds file, it's likely related to some missing libraries in R or other R problems.

kotai-michael commented 5 months ago

@noranekonobokkusu Thank you for the reply. I was going to make a PR for some updates I did on the code, and realized that the YAML part is fixed in the master branch but just not the one from PyPI, the only remained is related to the snakemake commands.

I realized that changing from, snake_log_out = ' &> ' + snake_log to snake_log_out = ' > ' + snake_log + ' 2>&1' in cli.py solved my last issue on running the test data. With &>, scatter and gather jobs are started in parallel and failed to wait for the results from the other job and, in the end, there were no A.txt.gz, etc. in the final folder. Plus, the cleanup step seemed to be happened when the snakemake jobs were still running. The snakemake version is I installed is snakemake==7.32.4.

In summary,

YAML fix to use yaml.YAML() instead of yaml(), which is not updated in the PyPI version
pulp==2.7.0 is required for snakemake==7.32.4
the stdout and stderr redirect symbols for snakemake commands

As the test run is finished, I tried to go back to single cell datasets, but hit some walls again. I wonder if there are any suggestions on how the BAM files should be prepared ?

Right now, because it's 10x single cell data, I'm using Cell Ranger for producing the BAM files. However, I found that fgbio commands in oneSample_maegatk.py are expecting to have more info in the BAM files, like mate pairs, and needed to be added with some extra commands before producing temp1.bam. These are the commands I tried to use and looks promising right now.

# extra commands I added in oneSample_maegatk.py

# http://fulcrumgenomics.github.io/fgbio/tools/latest/
# 1.5) filter out sequences without CB or UB
prep_cmd = f"samtools view -h " + temp_bam0 + "| awk 'BEGIN {OFS=\"\t\"} /^@/ {print; next} {for(i=12;i<=NF;i++) {if ($i ~ /^CB:Z:/) cb=1; if ($i ~ /^UB:Z:/) ub=1;} if (cb && ub) print; cb=0; ub=0;}' | samtools view -b -o " + temp_bam0f
os.system(f'echo {prep_cmd}')
os.system(prep_cmd)
# sort by Queryname
prep_cmd = f"{fgio} SortBam -i {temp_bam0f} -o {temp_bam0s} --sort-order=Queryname"
os.system(f'echo {prep_cmd}')
os.system(prep_cmd)
# add mate info
prep_cmd = f"{fgio} SetMateInformation -i {temp_bam0s} -o {temp_bam0smp}"
os.system(f'echo {prep_cmd}')
os.system(prep_cmd)

# 2) Sort the filtered bam file
# continue from step 2

Any comments would be highly appreciated.

petervangalen commented 5 months ago

For the MAESTER paper analysis, I followed the orange boxes in Supplementary Figure 5 of the paper (https://doi.org/10.1038/s41587-022-01210-8). This does not involve Cellranger.

kotai-michael commented 5 months ago

@petervangalen Thank you for the reply. I saw there is a samtools merge for merging back with scRNA-seq data. Does that mean that the MAESTER fastq dataset used orange boxes and and scRNA-seq used black boxes on the left ? I wonder how was the quality control on the black boxes implemented ?

kotai-michael commented 4 months ago

In the end, I figured out all the problems and had the maegatk runs finished :)

Before I'm closing this issue, I realized that there are 2 issues, https://github.com/caleblareau/maegatk/issues/5 and https://github.com/caleblareau/maegatk/issues/7, that are also having environment setup problems, I guess I will summarize what I found during my debug process here and hopefully these will help people that come across the same issues,

if you couldn't see .coverage.txt in temp/sparse_matrices folder, possibly you are having some issues on the Scatter job. Things you would like to check,
- make sure you have these commands work in your environment,
  - python (not python3), bwa, samtools, java
- latest version of snakemake, 8.14.0, is not compatible with maegatk 0.2.0, you might want to downgrade to for example 7.32.4 and pulp==2.7.0, to work with this package
- make sure you are downloading the code from this repo master branch instead of pip
  - pip version has a yaml error, so download the master branch and install with pip install setup.py suggested from here (the other issue)
If you keep having some issues, remove .snakemake folder and your output folder and rerun from fresh again
- sometimes, when there were a failed snakemake jobs exited unexpectedly, snake jobs will not rerun the relevant jobs, due to some lock files inside the .snakemake folder, or halfly-produced output files being identified as finished
- as the BAM split part is very time consuming, you might want to keep a copy of temp/barcoded_bams folder somewhere and start your new rerun from this BAM split result
very importantly, make sure you could finish the test run with your environment
- this is to confirm if you have the correct environment setup
steps for reusing the BAM split outputs, (taking test run as an example)
- create an output folder in advance, mkdir -p tests/test_maester2/temp
- copy the barcoded_bams folder into it, cp -r tests/test_maester/temp/barcoded_bams tests/test_maester2/temp
- remove the .sankemake folder, rm -rf .snakemake
- rerun maegatk with no bam split parameter,
  - maegatk bcall -i tests/data/test_maester.bam -o tests/test_maester2 --skip-barcodesplit -z
if you use customized fasta reference file, you need to make sure a few things,
- oneSample_maegatk.py used the fasta reference with bwa, which requires to have your fasta being indexed first
- bwa index your_fasta.fa
- you will need to filter out the fasta file to have only the mitochondrial sequence, chrM or MT or any other names depending on where you downloaded the fasta file

Thank you very much again, to all the maintainers, for putting all the code into this Python package and keeping the code up-to-date :)

caleblareau / maegatk

"Too many open files" error when working with a scRNA-seq dataset #17