Closed kotai-michael closed 4 months ago
I've not encountered this error before.
@petervangalen Thanks for the suggestion. I tried to run the test data and it seemed to be finished successfully but I couldn't see the final rds output. There seems some errors in the Snakemake scripts. Here is the logs.
Hi @kotai-michael, Can you go through your output files as described here and describe what files were generated? Also, can you provide the log file for Scatter from the logs/ folder? It might contain more informative messages if it crashed at the Scatter step.
@noranekonobokkusu Thanks for the reply.
These are the created folders. Snakemake logs also attached. Unfortunately I couldn't see much info in the snakemake logs neither.
tests/test_maester/fasta:
tests/test_maester/final:
barcodeQuants.tsv chrM_refAllele.txt maegatk.depthTable.txt passingBarcodes.tsv
tests/test_maester/logs:
base.maegatk.log filterlogs maegatk.parameters.txt maegatk.snakemake_gather.log maegatk.snakemake_scatter.log maegatk.snakemake_scatter.stats rmdupslogs
tests/test_maester/qc:
depth quality
tests/test_maester/temp:
barcoded_bams quality ready_bam scattered.allSamples.txt sparse_matrices temp_bam
tests/test_maester/qc/depth:
ACAATTAGCACT-1.depth.txt CCCGTCGTGGTA-1.depth.txt GACCGTGCATTT-1.depth.txt GGGTTCGCCTCC-1.depth.txt TGGTAGTGAACC-1.depth.txt
CCACAAAACATG-1.depth.txt CTAACCCGGAAT-1.depth.txt GGCGCTAATGAA-1.depth.txt GTACCCACAGCC-1.depth.txt TTCCTACGCAAT-1.depth.txt
tests/test_maester/qc/quality:
2024-06-06T110703.560838.snakemake.log 2024-06-06T110703.563637.snakemake.log
ls -lart *My system usually generates snakemake.log files called by .Scatter and .Gather. I wonder why your system doesn't do that, these log files are not very informative indeed, but we can try to trace down the issue by exploring what intermediate files fail to generate.
@noranekonobokkusu Thanks for the reply. I deleted the whole output folder and typed the command you suggested. This is the result.
I was getting the snakemake log files from .snakemake/log
folder. Those didn't have the .Scatter
nor .Gather
. I found that there were another log folder inside the output folder but the snakemake logs seemed to be emptied ? Maybe snakemake scripts didn't execute properly ?
tests/test_maester/logs$ ls -lah
total 56K
drwxrwxr-x 4 kotai kotai 4.0K 6月 6 14:02 .
drwxrwxr-x 8 kotai kotai 4.0K 6月 6 14:02 ..
-rw-rw-r-- 1 kotai kotai 440 6月 6 14:02 base.maegatk.log
drwxrwxr-x 2 kotai kotai 4.0K 6月 6 14:02 filterlogs
-rw-rw-r-- 1 kotai kotai 415 6月 6 14:02 maegatk.parameters.txt
-rw-rw-r-- 1 kotai kotai 0 6月 6 14:02 maegatk.snakemake_gather.log
-rw-rw-r-- 1 kotai kotai 0 6月 6 14:02 maegatk.snakemake_scatter.log
-rw-rw-r-- 1 kotai kotai 32K 6月 6 14:02 maegatk.snakemake_scatter.stats
drwxrwxr-x 2 kotai kotai 4.0K 6月 6 14:02 rmdupslogs
My Python version is 3.10.12 and R version is 4.1.2.
From the content of your temp/ folder, I can say the Scatter step finished successfully, you have all the expected files (can you confirm you have non-empty files in test_maester/qc/depth/?). It's either Gather that simply aggregates files together, or R, that's causing the issue. What's the content of the final/ folder? If you have non-empty .gz files but not the .rds file, it's likely related to some missing libraries in R or other R problems.
@noranekonobokkusu Thank you for the reply. I was going to make a PR for some updates I did on the code, and realized that the YAML part is fixed in the master branch but just not the one from PyPI, the only remained is related to the snakemake commands.
I realized that changing from,
snake_log_out = ' &> ' + snake_log
to
snake_log_out = ' > ' + snake_log + ' 2>&1'
in cli.py
solved my last issue on running the test data. With &>
, scatter and gather jobs are started in parallel and failed to wait for the results from the other job and, in the end, there were no A.txt.gz
, etc. in the final folder. Plus, the cleanup step seemed to be happened when the snakemake jobs were still running.
The snakemake version is I installed is snakemake==7.32.4
.
In summary,
As the test run is finished, I tried to go back to single cell datasets, but hit some walls again. I wonder if there are any suggestions on how the BAM files should be prepared ?
Right now, because it's 10x single cell data, I'm using Cell Ranger for producing the BAM files. However, I found that fgbio commands in oneSample_maegatk.py
are expecting to have more info in the BAM files, like mate pairs, and needed to be added with some extra commands before producing temp1.bam. These are the commands I tried to use and looks promising right now.
# extra commands I added in oneSample_maegatk.py
# http://fulcrumgenomics.github.io/fgbio/tools/latest/
# 1.5) filter out sequences without CB or UB
prep_cmd = f"samtools view -h " + temp_bam0 + "| awk 'BEGIN {OFS=\"\t\"} /^@/ {print; next} {for(i=12;i<=NF;i++) {if ($i ~ /^CB:Z:/) cb=1; if ($i ~ /^UB:Z:/) ub=1;} if (cb && ub) print; cb=0; ub=0;}' | samtools view -b -o " + temp_bam0f
os.system(f'echo {prep_cmd}')
os.system(prep_cmd)
# sort by Queryname
prep_cmd = f"{fgio} SortBam -i {temp_bam0f} -o {temp_bam0s} --sort-order=Queryname"
os.system(f'echo {prep_cmd}')
os.system(prep_cmd)
# add mate info
prep_cmd = f"{fgio} SetMateInformation -i {temp_bam0s} -o {temp_bam0smp}"
os.system(f'echo {prep_cmd}')
os.system(prep_cmd)
# 2) Sort the filtered bam file
# continue from step 2
Any comments would be highly appreciated.
For the MAESTER paper analysis, I followed the orange boxes in Supplementary Figure 5 of the paper (https://doi.org/10.1038/s41587-022-01210-8). This does not involve Cellranger.
@petervangalen Thank you for the reply.
I saw there is a samtools merge
for merging back with scRNA-seq data. Does that mean that the MAESTER fastq dataset used orange boxes and and scRNA-seq used black boxes on the left ? I wonder how was the quality control on the black boxes implemented ?
In the end, I figured out all the problems and had the maegatk runs finished :)
Before I'm closing this issue, I realized that there are 2 issues, https://github.com/caleblareau/maegatk/issues/5 and https://github.com/caleblareau/maegatk/issues/7, that are also having environment setup problems, I guess I will summarize what I found during my debug process here and hopefully these will help people that come across the same issues,
.coverage.txt
in temp/sparse_matrices
folder, possibly you are having some issues on the Scatter job. Things you would like to check,
python
(not python3
), bwa
, samtools
, java
7.32.4
and pulp==2.7.0
, to work with this packagepip install setup.py
suggested from here (the other issue).snakemake
folder and your output folder and rerun from fresh again
.snakemake
folder, or halfly-produced output files being identified as finishedtemp/barcoded_bams
folder somewhere and start your new rerun from this BAM split resultmkdir -p tests/test_maester2/temp
cp -r tests/test_maester/temp/barcoded_bams tests/test_maester2/temp
.sankemake
folder, rm -rf .snakemake
maegatk bcall -i tests/data/test_maester.bam -o tests/test_maester2 --skip-barcodesplit -z
bwa
, which requires to have your fasta being indexed firstbwa index your_fasta.fa
chrM
or MT
or any other names depending on where you downloaded the fasta fileThank you very much again, to all the maintainers, for putting all the code into this Python package and keeping the code up-to-date :)
Hello.
I tried to reproduce the MAESTER publication but I hit some issues on having too many open file limit on a scRNA-seq dataset.
My steps are,
1-SRR15598773_chrM_bam.bam
GRCh38_chrM.fa
maegatk bcall -i 1-SRR15598773_chrM_bam.bam -o maegatk_outs1 -n SRR15598773-scrna -g GRCh38_chrM.fa -c 12 -bt CB -qc -ub UB -jm 60000m -mr 3 -z
However, there were some errors when splitting the bam files.
I've already tried to increase the open file limit in the system to 1048576 but still didn't work. Any suggestions on how to filter the cell barcodes or any ways to solve the issues would be appreciated.
Michael