caleblareau / maegatk

Mitochondrial Alteration Enrichment and Genome Analysis Toolkit
MIT License
17 stars 2 forks source link

"Too many open files" error when working with a scRNA-seq dataset #17

Closed kotai-michael closed 4 months ago

kotai-michael commented 5 months ago

Hello.

I tried to reproduce the MAESTER publication but I hit some issues on having too many open file limit on a scRNA-seq dataset.

My steps are,

However, there were some errors when splitting the bam files.

Traceback (most recent call last):
  File "/home/michael/.local/lib/python3.10/site-packages/maegatk/bin/python/split_barcoded_bam.py", line 64, in <module>
    with multi_file_manager(bambcfiles) as fopen:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/michael/.local/lib/python3.10/site-packages/maegatk/bin/python/split_barcoded_bam.py", line 56, in multi_file_manager
    files = [pysam.AlignmentFile(file, "wb", template = temp) for file in files]
  File "/home/michael/.local/lib/python3.10/site-packages/maegatk/bin/python/split_barcoded_bam.py", line 56, in <listcomp>
    files = [pysam.AlignmentFile(file, "wb", template = temp) for file in files]
  File "pysam/libcalignmentfile.pyx", line 748, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 921, in pysam.libcalignmentfile.AlignmentFile._open
OSError: [Errno 24] could not open alignment file `/home/michael/maegatk_outs/temp/barcoded_bams/TTACTGTTCGGAGTGA-1.bam`: Too many open files
[E::hts_open_format] Failed to open file "/home/michael/maegatk_outs/temp/barcoded_bams/GCCAACGAGGTAGACC-1.bam" : Too many open files

I've already tried to increase the open file limit in the system to 1048576 but still didn't work. Any suggestions on how to filter the cell barcodes or any ways to solve the issues would be appreciated.

Michael

petervangalen commented 5 months ago

I've not encountered this error before.

  1. Are you able to successfully run the test data?
  2. You might try filtering for high-quality cell barcodes detected in the scRNA-seq data.
kotai-michael commented 5 months ago

@petervangalen Thanks for the suggestion. I tried to run the test data and it seemed to be finished successfully but I couldn't see the final rds output. There seems some errors in the Snakemake scripts. Here is the logs.

noranekonobokkusu commented 5 months ago

Hi @kotai-michael, Can you go through your output files as described here and describe what files were generated? Also, can you provide the log file for Scatter from the logs/ folder? It might contain more informative messages if it crashed at the Scatter step.

kotai-michael commented 5 months ago

@noranekonobokkusu Thanks for the reply.

These are the created folders. Snakemake logs also attached. Unfortunately I couldn't see much info in the snakemake logs neither.

tests/test_maester/fasta:

tests/test_maester/final:
barcodeQuants.tsv  chrM_refAllele.txt  maegatk.depthTable.txt  passingBarcodes.tsv

tests/test_maester/logs:
base.maegatk.log  filterlogs  maegatk.parameters.txt  maegatk.snakemake_gather.log  maegatk.snakemake_scatter.log  maegatk.snakemake_scatter.stats  rmdupslogs

tests/test_maester/qc:
depth  quality

tests/test_maester/temp:
barcoded_bams  quality  ready_bam  scattered.allSamples.txt  sparse_matrices  temp_bam

tests/test_maester/qc/depth:
ACAATTAGCACT-1.depth.txt  CCCGTCGTGGTA-1.depth.txt  GACCGTGCATTT-1.depth.txt  GGGTTCGCCTCC-1.depth.txt  TGGTAGTGAACC-1.depth.txt
CCACAAAACATG-1.depth.txt  CTAACCCGGAAT-1.depth.txt  GGCGCTAATGAA-1.depth.txt  GTACCCACAGCC-1.depth.txt  TTCCTACGCAAT-1.depth.txt

tests/test_maester/qc/quality:

2024-06-06T110703.560838.snakemake.log 2024-06-06T110703.563637.snakemake.log

noranekonobokkusu commented 5 months ago
  1. Can you re-run the test dataset from scratch, deleting the entire output folder?
  2. After that, can you provide the content of folders inside tests/test_maester/temp, with their sizes? e.g. with a command
    ls -lart *
    My system usually generates snakemake.log files called by .Scatter and .Gather. I wonder why your system doesn't do that, these log files are not very informative indeed, but we can try to trace down the issue by exploring what intermediate files fail to generate.
kotai-michael commented 5 months ago

@noranekonobokkusu Thanks for the reply. I deleted the whole output folder and typed the command you suggested. This is the result.

temp_folder.txt

I was getting the snakemake log files from .snakemake/log folder. Those didn't have the .Scatter nor .Gather. I found that there were another log folder inside the output folder but the snakemake logs seemed to be emptied ? Maybe snakemake scripts didn't execute properly ?

tests/test_maester/logs$ ls -lah
total 56K
drwxrwxr-x 4 kotai kotai 4.0K  6月  6 14:02 .
drwxrwxr-x 8 kotai kotai 4.0K  6月  6 14:02 ..
-rw-rw-r-- 1 kotai kotai  440  6月  6 14:02 base.maegatk.log
drwxrwxr-x 2 kotai kotai 4.0K  6月  6 14:02 filterlogs
-rw-rw-r-- 1 kotai kotai  415  6月  6 14:02 maegatk.parameters.txt
-rw-rw-r-- 1 kotai kotai    0  6月  6 14:02 maegatk.snakemake_gather.log
-rw-rw-r-- 1 kotai kotai    0  6月  6 14:02 maegatk.snakemake_scatter.log
-rw-rw-r-- 1 kotai kotai  32K  6月  6 14:02 maegatk.snakemake_scatter.stats
drwxrwxr-x 2 kotai kotai 4.0K  6月  6 14:02 rmdupslogs

My Python version is 3.10.12 and R version is 4.1.2.

noranekonobokkusu commented 5 months ago

From the content of your temp/ folder, I can say the Scatter step finished successfully, you have all the expected files (can you confirm you have non-empty files in test_maester/qc/depth/?). It's either Gather that simply aggregates files together, or R, that's causing the issue. What's the content of the final/ folder? If you have non-empty .gz files but not the .rds file, it's likely related to some missing libraries in R or other R problems.

kotai-michael commented 5 months ago

@noranekonobokkusu Thank you for the reply. I was going to make a PR for some updates I did on the code, and realized that the YAML part is fixed in the master branch but just not the one from PyPI, the only remained is related to the snakemake commands.

I realized that changing from, snake_log_out = ' &> ' + snake_log to snake_log_out = ' > ' + snake_log + ' 2>&1' in cli.py solved my last issue on running the test data. With &>, scatter and gather jobs are started in parallel and failed to wait for the results from the other job and, in the end, there were no A.txt.gz, etc. in the final folder. Plus, the cleanup step seemed to be happened when the snakemake jobs were still running. The snakemake version is I installed is snakemake==7.32.4.

In summary,

As the test run is finished, I tried to go back to single cell datasets, but hit some walls again. I wonder if there are any suggestions on how the BAM files should be prepared ?

Right now, because it's 10x single cell data, I'm using Cell Ranger for producing the BAM files. However, I found that fgbio commands in oneSample_maegatk.py are expecting to have more info in the BAM files, like mate pairs, and needed to be added with some extra commands before producing temp1.bam. These are the commands I tried to use and looks promising right now.

# extra commands I added in oneSample_maegatk.py

# http://fulcrumgenomics.github.io/fgbio/tools/latest/
# 1.5) filter out sequences without CB or UB
prep_cmd = f"samtools view -h " + temp_bam0 + "| awk 'BEGIN {OFS=\"\t\"} /^@/ {print; next} {for(i=12;i<=NF;i++) {if ($i ~ /^CB:Z:/) cb=1; if ($i ~ /^UB:Z:/) ub=1;} if (cb && ub) print; cb=0; ub=0;}' | samtools view -b -o " + temp_bam0f
os.system(f'echo {prep_cmd}')
os.system(prep_cmd)
# sort by Queryname
prep_cmd = f"{fgio} SortBam -i {temp_bam0f} -o {temp_bam0s} --sort-order=Queryname"
os.system(f'echo {prep_cmd}')
os.system(prep_cmd)
# add mate info
prep_cmd = f"{fgio} SetMateInformation -i {temp_bam0s} -o {temp_bam0smp}"
os.system(f'echo {prep_cmd}')
os.system(prep_cmd)

# 2) Sort the filtered bam file
# continue from step 2

Any comments would be highly appreciated.

petervangalen commented 5 months ago

For the MAESTER paper analysis, I followed the orange boxes in Supplementary Figure 5 of the paper (https://doi.org/10.1038/s41587-022-01210-8). This does not involve Cellranger.

kotai-michael commented 5 months ago

@petervangalen Thank you for the reply. I saw there is a samtools merge for merging back with scRNA-seq data. Does that mean that the MAESTER fastq dataset used orange boxes and and scRNA-seq used black boxes on the left ? I wonder how was the quality control on the black boxes implemented ?

kotai-michael commented 4 months ago

In the end, I figured out all the problems and had the maegatk runs finished :)

Before I'm closing this issue, I realized that there are 2 issues, https://github.com/caleblareau/maegatk/issues/5 and https://github.com/caleblareau/maegatk/issues/7, that are also having environment setup problems, I guess I will summarize what I found during my debug process here and hopefully these will help people that come across the same issues,

Thank you very much again, to all the maintainers, for putting all the code into this Python package and keeping the code up-to-date :)