Closed dy-lin closed 1 year ago
I was at one point able to run mGATK without errors, but had the same situation in #59 where files were missing. I suspect something in my PATH environment changed, e.g. pip or conda packages got updated, or even a different python3 executable.
can you double check the barcodes files? mgatk only wants one item per line (corresponding to one barcode) but I think the multiome barcodes.tsv file has multiple entries (for both the ATAC and RNA barcode)?
The barcodes.tsv file looks like this. The multiome shares barcodes between RNA and ATAC. I've chosen to use a public multiome dataset (https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-3-k-1-standard-2-0-0) for this run, so you can attempt to reproduce the error as well.
$ head barcodes.tsv
AAACAGCCAAATATCC-1
AAACAGCCAGGAACTG-1
AAACAGCCAGGCTTCG-1
AAACCAACACCTGCTC-1
AAACCAACAGATTCAT-1
AAACCAACAGTTGCGT-1
AAACCAACATAACGGG-1
AAACCAACATAGACCC-1
AAACCGCGTGAGGTAG-1
AAACGCGCATACCCGG-1
Ideally, I'd like to get the entire tool working, but if I could reproduce previous behaviour where the ACGT files and RDS would be correctly outputted but missing other files, that would work for the meantime too.
Here's an OLD run where the silent error persisted (found via snapshots), but the files such as mgatk.depthTable.txt
were written out successfully. Because this file was obtained via snapshots, I do not have the mgatk.snakemake_tenx.log
file because it was overwritten.
2022-04-21T132444.896479.snakemake.log
Here's a run where the silent error persisted today.: 2022-05-20T125435.443790.snakemake.log and mgatk.snakemake_tenx.log
I managed to get a silent error today instead of the 'Improper folder specification', but after tinkering with my pip freeze
to get an accurate list of ~/.local/bin
, it stopped working and I am once again getting the "improper folder specification" error.
Hey Caleb,
If it helps, I'm getting the same thing about 'Improper Folder Specification' and have provided the snakemake log. I've deleted the .snakemake before starting another run but still running into the same issue. This was a DOGMA-Seq run – is it possible that the format of the atac bam is slightly different from just a standard scATAC run like you would with asap-seq?
Here are some of my barcodes for reference as well:
head ~/cluster/home/pcallen/projects/dogma_seq_pilot/data/20220429_dogma_run/counts_output/counts_mtDNA_masked/sample_3/outs/filtered_feature_bc_matrix/barcodes.tsv
AAACAGCCAACAGCCT-1
AAACAGCCACATAACT-1
AAACAGCCACATTAAC-1
AAACAGCCAGAATGAC-1
AAACAGCCAGGCATCT-1
AAACAGCCATCGTTCT-1
AAACATGCACCATATG-1
AAACATGCAGTAAAGC-1
AAACATGCATGGCCCA-1
AAACCAACAAATTCGT-1
@PeterCAllen your error is a bit more tangible; I had to deal with this when we first had access to the multiome kit, and indeed there was an error in the release of CellRanger-ATAC / ARC v2 that we fixed in v0.6.3
; can you double check your version.
@dy-lin I looked through your logs and couldn't see anything obvious other than your number of slices seem to be much larger than mine; can you try with -c 8
or so to limit the slice number? Otherwise, since this is the standard multiome run, there is very very low amounts of mtDNA so there may be some error with cells essentially have 0 mitochondrial data that is messing this up. Can you supply the -bc file with only the top 100 barcodes based on mitochondria perhaps?
Hey @caleblareau, thanks for the quick reply. My version of mgatk is v0.6.6. In case it helps – here's the base log in addition to the parameter file. base.mgatk.log parameters.txt
@caleblareau if it helps the data was run with Cellranger ARC v2.0.1, and post-CellRanger ARC BAMs from the 10X website (https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-3-k-1-standard-2-0-0).
I tried using -c 8
and still run into the same error. I am concerned it has something to do with my installation (see #59) as it seems sometimes I am able to generate some files. Are you able to reproduce the error on a machine where you know the installation works?
Grabbing the top 100 barcodes:
samtools view pbmc_granulocyte_sorted_3k_atac_possorted_bam.bam "chrM" | grep -wf filtered_feature_bc_matrix/barcodes.tsv | grep -oE 'CB:[A-Z]:[ACGT]+-1' | cut -f3 -d: | sort | uniq -c | sort -nr | awk '{print $2}' | head -n 100 > barcodes.tsv
Still results in error: mgatk.snakemake_tenx.log
Is there anyway to not have the temporary files delete afterwards so we can see perhaps which barcodes are causing issues?
If you run with the -z flag, it keeps all of the temporary files; I’ll try to dig more into this later tonight / tomorrow.
On May 24, 2022, at 2:58 PM, Diana Lin @.**@.>> wrote:
Grabbing the top 100 barcodes:
samtools view pbmc_granulocyte_sorted_3k_atac_possorted_bam.bam "chrM" | grep -wf filtered_feature_bc_matrix/barcodes.tsv | grep -oE 'CB:[A-Z]:[ACGT]+-1' | cut -f3 -d: | sort | uniq -c | sort -nr | awk '{print $2}' | head -n 100 > barcodes.tsv
Still results in error: mgatk.snakemake_tenx.loghttps://github.com/caleblareau/mgatk/files/8766466/mgatk.snakemake_tenx.log
Is there anyway to not have the temporary files delete afterwards so we can see perhaps which barcodes are causing issues?
— Reply to this email directly, view it on GitHubhttps://github.com/caleblareau/mgatk/issues/60#issuecomment-1136469331, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD32FYP7HEAN3WVFCDAWTADVLVGG7ANCNFSM5V4H2GKQ. You are receiving this because you were mentioned.Message ID: @.***>
I tried looking at the BAMs, but since I'm not sure which line in process_one_slice
is throwing the error, but all the barcodes listed in the .txt are found in the input BAM.
<< 03:16 PM >> [ dlin@gphost03: barcode_files ] $ samtools view ../barcoded_bams/barcodes.1.bam | grep -oE 'CB:[A-Z]:[ACGT]+-1' | cut -f3 -d: | sort | uniq -c | sort -nr | awk '{print $2, $1}'
AACGCCCAGTTTGAGC-1 43896
TGAAACTGTGACATAT-1 32254
ATCCACCTCGTCAAGT-1 29902
TAGCTAGGTACGCGCA-1 27611
TGTGGCCAGCGAGTAA-1 25288
GGGTGAAGTCACAAAT-1 22872
CCACTTGGTGGTTCTT-1 22550
TTTCGTCCAGAGGGAG-1 20324
TCAATCGCATGCTTAG-1 19950
TTAAGGACAGGTTCAC-1 19618
ACCAGGGAGCGCTCAA-1 19399
TCGTAATCAGCCTAAC-1 18466
TACTGCACAAACTAAG-1 17567
I managed to find out why using --keep-duplicates allowed me to get further but still eventual error-- the java heap size was too small. I had to increase it to -Xmx10000000
. After adjusting that, I'm encountering this error:
Traceback (most recent call last):
File "/projects/karsanlab/dlin_dev/MoHCC/KARSANBIO-3015_scATAC-seq_Pipeline/KARSANBIO-3034_Testing_mGATK/mgatk/mgatk/bin/python/sumstatsBPtenx.py", line 80, in <module>
writeSparseMatrixLetter("A", 0)
File "/projects/karsanlab/dlin_dev/MoHCC/KARSANBIO-3015_scATAC-seq_Pipeline/KARSANBIO-3034_Testing_mGATK/mgatk/mgatk/bin/python/sumstatsBPtenx.py", line 68, in writeSparseMatrixLetter
with open(out_file_fn,"w") as file_handle_fn:
FileNotFoundError: [Errno 2] No such file or directory: '/projects/karsanlab/dlin_dev/MoHCC/KARSANBIO-3015_scATAC-seq_Pipeline/KARSANBIO-3034_Testing_mGATK/multiome/mgatk_out/temp/sparse_matrices/barcodes.1.A.txt'
@caleblareau After some investigation, I have managed to somewhat resolve my issue. I believe that some folders are not being created:
/temp/sparse_matrices/
and /temp/qc/depth/
By adding these two lines after https://github.com/caleblareau/mgatk/blob/f0a27c1b57a27f3bf6af29c60bc89b35ea73c08b/mgatk/bin/snake/Snakefile.tenx#L75
if not os.path.exists(outdir + "/temp/sparse_matrices/"):
os.mkdir(outdir + "/temp/sparse_matrices/")
By adding these two lines after https://github.com/caleblareau/mgatk/blob/f0a27c1b57a27f3bf6af29c60bc89b35ea73c08b/mgatk/bin/python/sumstatsBPtenx.py#L86
if not os.path.exists(os.path.dirname(out_file_depth)):
os.mkdir(os.path.dirname(out_file_depth))
I was able to bypass these errors, and the files seem to be created correctly:
(mGATK) << 12:09 PM >> [ dlin@gphost03: multiome ] $ ls -lRh mgatk_out_fix/
mgatk_out_fix/:
total 20K
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:41 fasta
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:45 final
drwxrwsr-x 4 dlin karsanlab 4.0K May 30 11:44 logs
drwxrwsr-x 4 dlin karsanlab 4.0K May 30 11:41 qc
drwxrwsr-x 8 dlin karsanlab 4.0K May 30 11:41 temp
mgatk_out_fix/fasta:
total 20K
-rw-rw-r-- 1 dlin karsanlab 17K May 30 11:41 chrM.fasta
-rw-rw-r-- 1 dlin karsanlab 19 May 30 11:41 chrM.fasta.fai
mgatk_out_fix/final:
total 192M
-rw-rw-r-- 1 dlin karsanlab 119K May 30 11:41 chrM_refAllele.txt
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:43 mgatk.A.txt.gz
-rw-rw-r-- 1 dlin karsanlab 2.2K May 30 11:44 mgatk.cell_heteroplasmic_df.tsv.gz
-rw-rw-r-- 1 dlin karsanlab 36M May 30 11:43 mgatk.coverage.txt.gz
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:43 mgatk.C.txt.gz
-rw-rw-r-- 1 dlin karsanlab 64K May 30 11:42 mgatk.depthTable.txt
-rw-rw-r-- 1 dlin karsanlab 6.1M May 30 11:43 mgatk.G.txt.gz
-rw-rw-r-- 1 dlin karsanlab 44M May 30 11:45 mgatk.rds
-rw-rw-r-- 1 dlin karsanlab 69M May 30 11:45 mgatk.signac.rds
-rw-rw-r-- 1 dlin karsanlab 11M May 30 11:43 mgatk.T.txt.gz
-rw-rw-r-- 1 dlin karsanlab 145K May 30 11:44 mgatk.variant_stats.tsv.gz
-rw-rw-r-- 1 dlin karsanlab 28K May 30 11:44 mgatk.vmr_strand_plot.png
mgatk_out_fix/logs:
total 20M
-rw-rw-r-- 1 dlin karsanlab 373 May 30 11:45 base.mgatk.log
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:42 filterlogs
-rw-rw-r-- 1 dlin karsanlab 558 May 30 11:41 mgatk.parameters.txt
-rw-rw-r-- 1 dlin karsanlab 20M May 30 11:44 mgatk.snakemake_tenx.log
-rw-rw-r-- 1 dlin karsanlab 23K May 30 11:44 mgatk.snakemake_tenx.stats
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:42 rmdupslogs
mgatk_out_fix/logs/filterlogs:
total 0
-rw-rw-r-- 1 dlin karsanlab 24 May 30 11:42 barcodes.1.filter.log
-rw-rw-r-- 1 dlin karsanlab 24 May 30 11:42 barcodes.2.filter.log
-rw-rw-r-- 1 dlin karsanlab 24 May 30 11:42 barcodes.3.filter.log
-rw-rw-r-- 1 dlin karsanlab 24 May 30 11:42 barcodes.4.filter.log
-rw-rw-r-- 1 dlin karsanlab 24 May 30 11:42 barcodes.5.filter.log
-rw-rw-r-- 1 dlin karsanlab 24 May 30 11:42 barcodes.6.filter.log
-rw-rw-r-- 1 dlin karsanlab 24 May 30 11:42 barcodes.7.filter.log
-rw-rw-r-- 1 dlin karsanlab 24 May 30 11:42 barcodes.8.filter.log
mgatk_out_fix/logs/rmdupslogs:
total 32K
-rw-rw-r-- 1 dlin karsanlab 2.9K May 30 11:42 barcodes.1.rmdups.log
-rw-rw-r-- 1 dlin karsanlab 2.9K May 30 11:42 barcodes.2.rmdups.log
-rw-rw-r-- 1 dlin karsanlab 2.9K May 30 11:42 barcodes.3.rmdups.log
-rw-rw-r-- 1 dlin karsanlab 2.9K May 30 11:42 barcodes.4.rmdups.log
-rw-rw-r-- 1 dlin karsanlab 2.9K May 30 11:42 barcodes.5.rmdups.log
-rw-rw-r-- 1 dlin karsanlab 2.9K May 30 11:42 barcodes.6.rmdups.log
-rw-rw-r-- 1 dlin karsanlab 2.8K May 30 11:42 barcodes.7.rmdups.log
-rw-rw-r-- 1 dlin karsanlab 2.9K May 30 11:42 barcodes.8.rmdups.log
mgatk_out_fix/qc:
total 8.0K
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:42 depth
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:41 quality
mgatk_out_fix/qc/depth:
total 64K
-rw-rw-r-- 1 dlin karsanlab 8.0K May 30 11:42 barcodes.1.depth.txt
-rw-rw-r-- 1 dlin karsanlab 8.0K May 30 11:42 barcodes.2.depth.txt
-rw-rw-r-- 1 dlin karsanlab 8.0K May 30 11:42 barcodes.3.depth.txt
-rw-rw-r-- 1 dlin karsanlab 8.0K May 30 11:42 barcodes.4.depth.txt
-rw-rw-r-- 1 dlin karsanlab 8.0K May 30 11:42 barcodes.5.depth.txt
-rw-rw-r-- 1 dlin karsanlab 8.0K May 30 11:42 barcodes.6.depth.txt
-rw-rw-r-- 1 dlin karsanlab 8.0K May 30 11:42 barcodes.7.depth.txt
-rw-rw-r-- 1 dlin karsanlab 7.9K May 30 11:42 barcodes.8.depth.txt
mgatk_out_fix/qc/quality:
total 0
mgatk_out_fix/temp:
total 24K
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:41 barcoded_bams
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:41 barcode_files
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:41 quality
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:42 ready_bam
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:42 sparse_matrices
drwxrwsr-x 2 dlin karsanlab 4.0K May 30 11:42 temp_bam
mgatk_out_fix/temp/barcoded_bams:
total 108M
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:41 barcodes.1.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:41 barcodes.1.bam.bai
-rw-rw-r-- 1 dlin karsanlab 15M May 30 11:41 barcodes.2.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:41 barcodes.2.bam.bai
-rw-rw-r-- 1 dlin karsanlab 12M May 30 11:41 barcodes.3.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:41 barcodes.3.bam.bai
-rw-rw-r-- 1 dlin karsanlab 11M May 30 11:41 barcodes.4.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:41 barcodes.4.bam.bai
-rw-rw-r-- 1 dlin karsanlab 13M May 30 11:41 barcodes.5.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:41 barcodes.5.bam.bai
-rw-rw-r-- 1 dlin karsanlab 17M May 30 11:41 barcodes.6.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:41 barcodes.6.bam.bai
-rw-rw-r-- 1 dlin karsanlab 16M May 30 11:41 barcodes.7.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:41 barcodes.7.bam.bai
-rw-rw-r-- 1 dlin karsanlab 13M May 30 11:41 barcodes.8.bam
-rw-rw-r-- 1 dlin karsanlab 1.7K May 30 11:41 barcodes.8.bam.bai
mgatk_out_fix/temp/barcode_files:
total 64K
-rw-rw-r-- 1 dlin karsanlab 6.3K May 30 11:41 barcodes.1.txt
-rw-rw-r-- 1 dlin karsanlab 6.3K May 30 11:41 barcodes.2.txt
-rw-rw-r-- 1 dlin karsanlab 6.3K May 30 11:41 barcodes.3.txt
-rw-rw-r-- 1 dlin karsanlab 6.3K May 30 11:41 barcodes.4.txt
-rw-rw-r-- 1 dlin karsanlab 6.3K May 30 11:41 barcodes.5.txt
-rw-rw-r-- 1 dlin karsanlab 6.3K May 30 11:41 barcodes.6.txt
-rw-rw-r-- 1 dlin karsanlab 6.3K May 30 11:41 barcodes.7.txt
-rw-rw-r-- 1 dlin karsanlab 6.3K May 30 11:41 barcodes.8.txt
mgatk_out_fix/temp/quality:
total 0
mgatk_out_fix/temp/ready_bam:
total 80M
-rw-rw-r-- 1 dlin karsanlab 9.9M May 30 11:42 barcodes.1.qc.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:42 barcodes.1.qc.bam.bai
-rw-rw-r-- 1 dlin karsanlab 11M May 30 11:42 barcodes.2.qc.bam
-rw-rw-r-- 1 dlin karsanlab 1.7K May 30 11:42 barcodes.2.qc.bam.bai
-rw-rw-r-- 1 dlin karsanlab 8.9M May 30 11:42 barcodes.3.qc.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:42 barcodes.3.qc.bam.bai
-rw-rw-r-- 1 dlin karsanlab 8.1M May 30 11:42 barcodes.4.qc.bam
-rw-rw-r-- 1 dlin karsanlab 1.7K May 30 11:42 barcodes.4.qc.bam.bai
-rw-rw-r-- 1 dlin karsanlab 9.7M May 30 11:42 barcodes.5.qc.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:42 barcodes.5.qc.bam.bai
-rw-rw-r-- 1 dlin karsanlab 12M May 30 11:42 barcodes.6.qc.bam
-rw-rw-r-- 1 dlin karsanlab 1.7K May 30 11:42 barcodes.6.qc.bam.bai
-rw-rw-r-- 1 dlin karsanlab 12M May 30 11:42 barcodes.7.qc.bam
-rw-rw-r-- 1 dlin karsanlab 1.7K May 30 11:42 barcodes.7.qc.bam.bai
-rw-rw-r-- 1 dlin karsanlab 9.1M May 30 11:42 barcodes.8.qc.bam
-rw-rw-r-- 1 dlin karsanlab 1.7K May 30 11:42 barcodes.8.qc.bam.bai
mgatk_out_fix/temp/sparse_matrices:
total 657M
-rw-rw-r-- 1 dlin karsanlab 12M May 30 11:42 barcodes.1.A.txt
-rw-rw-r-- 1 dlin karsanlab 36M May 30 11:42 barcodes.1.coverage.txt
-rw-rw-r-- 1 dlin karsanlab 13M May 30 11:42 barcodes.1.C.txt
-rw-rw-r-- 1 dlin karsanlab 5.1M May 30 11:42 barcodes.1.G.txt
-rw-rw-r-- 1 dlin karsanlab 9.5M May 30 11:42 barcodes.1.T.txt
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.2.A.txt
-rw-rw-r-- 1 dlin karsanlab 40M May 30 11:42 barcodes.2.coverage.txt
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.2.C.txt
-rw-rw-r-- 1 dlin karsanlab 5.7M May 30 11:42 barcodes.2.G.txt
-rw-rw-r-- 1 dlin karsanlab 11M May 30 11:42 barcodes.2.T.txt
-rw-rw-r-- 1 dlin karsanlab 13M May 30 11:42 barcodes.3.A.txt
-rw-rw-r-- 1 dlin karsanlab 39M May 30 11:42 barcodes.3.coverage.txt
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.3.C.txt
-rw-rw-r-- 1 dlin karsanlab 5.5M May 30 11:42 barcodes.3.G.txt
-rw-rw-r-- 1 dlin karsanlab 11M May 30 11:42 barcodes.3.T.txt
-rw-rw-r-- 1 dlin karsanlab 13M May 30 11:42 barcodes.4.A.txt
-rw-rw-r-- 1 dlin karsanlab 37M May 30 11:42 barcodes.4.coverage.txt
-rw-rw-r-- 1 dlin karsanlab 13M May 30 11:42 barcodes.4.C.txt
-rw-rw-r-- 1 dlin karsanlab 5.2M May 30 11:42 barcodes.4.G.txt
-rw-rw-r-- 1 dlin karsanlab 9.7M May 30 11:42 barcodes.4.T.txt
-rw-rw-r-- 1 dlin karsanlab 13M May 30 11:42 barcodes.5.A.txt
-rw-rw-r-- 1 dlin karsanlab 40M May 30 11:42 barcodes.5.coverage.txt
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.5.C.txt
-rw-rw-r-- 1 dlin karsanlab 5.6M May 30 11:42 barcodes.5.G.txt
-rw-rw-r-- 1 dlin karsanlab 11M May 30 11:42 barcodes.5.T.txt
-rw-rw-r-- 1 dlin karsanlab 15M May 30 11:42 barcodes.6.A.txt
-rw-rw-r-- 1 dlin karsanlab 45M May 30 11:42 barcodes.6.coverage.txt
-rw-rw-r-- 1 dlin karsanlab 16M May 30 11:42 barcodes.6.C.txt
-rw-rw-r-- 1 dlin karsanlab 6.3M May 30 11:42 barcodes.6.G.txt
-rw-rw-r-- 1 dlin karsanlab 12M May 30 11:42 barcodes.6.T.txt
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.7.A.txt
-rw-rw-r-- 1 dlin karsanlab 40M May 30 11:42 barcodes.7.coverage.txt
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.7.C.txt
-rw-rw-r-- 1 dlin karsanlab 5.7M May 30 11:42 barcodes.7.G.txt
-rw-rw-r-- 1 dlin karsanlab 11M May 30 11:42 barcodes.7.T.txt
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.8.A.txt
-rw-rw-r-- 1 dlin karsanlab 41M May 30 11:42 barcodes.8.coverage.txt
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.8.C.txt
-rw-rw-r-- 1 dlin karsanlab 5.8M May 30 11:42 barcodes.8.G.txt
-rw-rw-r-- 1 dlin karsanlab 11M May 30 11:42 barcodes.8.T.txt
mgatk_out_fix/temp/temp_bam:
total 217M
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.1.temp0.bam
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.1.temp1.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:42 barcodes.1.temp1.bam.bai
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.2.temp0.bam
-rw-rw-r-- 1 dlin karsanlab 15M May 30 11:42 barcodes.2.temp1.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:42 barcodes.2.temp1.bam.bai
-rw-rw-r-- 1 dlin karsanlab 12M May 30 11:42 barcodes.3.temp0.bam
-rw-rw-r-- 1 dlin karsanlab 13M May 30 11:42 barcodes.3.temp1.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:42 barcodes.3.temp1.bam.bai
-rw-rw-r-- 1 dlin karsanlab 11M May 30 11:42 barcodes.4.temp0.bam
-rw-rw-r-- 1 dlin karsanlab 11M May 30 11:42 barcodes.4.temp1.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:42 barcodes.4.temp1.bam.bai
-rw-rw-r-- 1 dlin karsanlab 13M May 30 11:42 barcodes.5.temp0.bam
-rw-rw-r-- 1 dlin karsanlab 14M May 30 11:42 barcodes.5.temp1.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:42 barcodes.5.temp1.bam.bai
-rw-rw-r-- 1 dlin karsanlab 17M May 30 11:42 barcodes.6.temp0.bam
-rw-rw-r-- 1 dlin karsanlab 17M May 30 11:42 barcodes.6.temp1.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:42 barcodes.6.temp1.bam.bai
-rw-rw-r-- 1 dlin karsanlab 16M May 30 11:42 barcodes.7.temp0.bam
-rw-rw-r-- 1 dlin karsanlab 16M May 30 11:42 barcodes.7.temp1.bam
-rw-rw-r-- 1 dlin karsanlab 1.8K May 30 11:42 barcodes.7.temp1.bam.bai
-rw-rw-r-- 1 dlin karsanlab 13M May 30 11:42 barcodes.8.temp0.bam
-rw-rw-r-- 1 dlin karsanlab 13M May 30 11:42 barcodes.8.temp1.bam
-rw-rw-r-- 1 dlin karsanlab 1.7K May 30 11:42 barcodes.8.temp1.bam.bai
The issue that remains is that the snakemake logfile still seems to think something is wrong, and thinks the output files aren't being produced when they are (see the ls -lhR
above):
mgatk.snakemake_tenx.log
[Mon May 30 11:41:55 2022]
rule process_one_slice:
input: mgatk_out_fix/.internal/samples/barcodes.2.bam.txt
output: mgatk_out_fix/qc/depth/barcodes.2.depth.txt, mgatk_out_fix/temp/sparse_matrices/barcodes.2.A.txt, mgatk_out_fix/temp/sparse_m
atrices/barcodes.2.C.txt, mgatk_out_fix/temp/sparse_matrices/barcodes.2.G.txt, mgatk_out_fix/temp/sparse_matrices/barcodes.2.T.txt, mgatk
_out_fix/temp/sparse_matrices/barcodes.2.coverage.txt
jobid: 3
reason: Missing output files: mgatk_out_fix/temp/sparse_matrices/barcodes.2.G.txt, mgatk_out_fix/qc/depth/barcodes.2.depth.txt, mgatk
_out_fix/temp/sparse_matrices/barcodes.2.coverage.txt, mgatk_out_fix/temp/sparse_matrices/barcodes.2.A.txt, mgatk_out_fix/temp/sparse_mat
rices/barcodes.2.C.txt, mgatk_out_fix/temp/sparse_matrices/barcodes.2.T.txt
wildcards: sample=barcodes.2
resources: tmpdir=/var/tmp
And again, the 'error messages' from #59 also persist here:
/projects/karsanlab/dlin_dev/software/.conda/envs/mGATK/lib/python3.9/site-packages/mgatk/bin/python/variant_calling.py:38: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
rev_base_df[missing_pos] = 0
/projects/karsanlab/dlin_dev/software/.conda/envs/mGATK/lib/python3.9/site-packages/mgatk/bin/python/variant_calling.py:159: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
'mean_coverage', 'max_heteroplasmy']].astype(np.float)
/home/dlin/.conda/envs/mGATK/lib/python3.9/site-packages/pandas/core/arraylike.py:397: RuntimeWarning: divide by zero encountered in log10
result = getattr(ufunc, method)(*inputs, **kwargs)
So I guess my final questions are:
@dy-lin thanks for the effort here :) just so I can confirm, did you rerun with the same output folder already existing, or did you still get these errors (1-3) if the output folder didn't exist? That's the only way off the top of my head I could imagine there being an issue that would have caused this error.
I'll re-run with an existing output directory and without one and see if that clears the snakemake log then I'll report back!
Looks like in either case, snakemake is still registering that files are missing even though they aren't. However, I discovered a quirk in installation. If I install all the dependencies (pysam=0.19.0, numpy, matplotlib) first using conda, and then install mGATK using conda's pip, mGATK does not run into the 'improper folder specification' error (without any of the code modifications I referred to above). ¯\_(ツ)_/¯
yeah, I'm now running into this issue:
[Mon Apr 17 21:20:00 2023]
rule process_one_slice:
input: INT1_DOGMA/mgatk/.internal/samples/barcodes.19.bam.txt
output: INT1_DOGMA/mgatk/qc/depth/barcodes.19.depth.txt, INT1_DOGMA/mgatk/temp/sparse_matrices/barcodes.19.A.txt, INT1_DOGMA/mgatk/temp/sparse_matrices/barcodes.19.C.txt, INT1_DOGMA/mgatk/temp/sparse_matrices/barcodes.19.G.txt, INT1_DOGMA/mgatk/temp/sparse_matrices/barcodes.19.T.txt, INT1_DOGMA/mgatk/temp/sparse_matrices/barcodes.19.coverage.txt
jobid: 28
reason: Missing output files: INT1_DOGMA/mgatk/temp/sparse_matrices/barcodes.19.coverage.txt, INT1_DOGMA/mgatk/qc/depth/barcodes.19.depth.txt, INT1_DOGMA/mgatk/temp/sparse_matrices/barcodes.19.T.txt, INT1_DOGMA/mgatk/temp/sparse_matrices/barcodes.19.G.txt, INT1_DOGMA/mgatk/temp/sparse_matrices/barcodes.19.A.txt, INT1_DOGMA/mgatk/temp/sparse_matrices/barcodes.19.C.txt; Code has changed since last execution
wildcards: sample=barcodes.19
resources: tmpdir=/fast/users/knighto_c/scratch/tmp
RuleException:
SamtoolsError in file /fast/work/users/knighto_c/bin/miniconda3/envs/mito/lib/python3.9/site-packages/mgatk/bin/snake/Snakefile.tenx, line 113:
'samtools returned with error 1: stdout=, stderr=samtools index: failed to create index for "INT1_DOGMA/mgatk/temp/ready_bam/barcodes.15.qc.bam"\n'
File "/fast/work/users/knighto_c/bin/miniconda3/envs/mito/lib/python3.9/site-packages/mgatk/bin/snake/Snakefile.tenx", line 113, in __rule_process_one_slice
File "/data/gpfs-1/users/knighto_c/work/bin/miniconda3/envs/mito/lib/python3.9/site-packages/pysam/utils.py", line 69, in __call__
File "/data/gpfs-1/users/knighto_c/work/bin/miniconda3/envs/mito/lib/python3.9/concurrent/futures/thread.py", line 58, in run
and then eventually,
Error in checkGrep(grep(".A.txt", files)) :
Improper folder specification; file missing / extra file present. See documentation
Calls: importMito -> checkGrep
Execution halted
@dy-lin, could you post the exact commands you used to create the mgatk environment, and then run genotyping? thanks so much!
alright, it's not very helpful as I can't fully figure out what's going on, but this is how I solved it:
# clean miniconda3 installation
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p miniconda3 && rm Miniconda3-latest-Linux-x86_64.sh
source miniconda3/etc/profile.d/conda.sh && conda activate
conda install mamba
# create mgatk env
mamba create -y -n mito openjdk r-data.table r-matrix bioconductor-genomicranges bioconductor-summarizedexperiment
conda activate mito
pip install mgatk
# DOGMA-seq run, so barcodes needs to be gunzipped
sample_id=folder_name
cp $sample_id/outs/filtered_feature_bc_matrix/barcodes.tsv.gz $sample_id/outs/filtered_feature_bc_matrix/barcodes1.tsv.gz
gunzip $sample_id/outs/filtered_feature_bc_matrix/barcodes1.tsv.gz
mgatk tenx -i $sample_id/outs/atac_possorted_bam.bam -n $sample_id -o $sample_id/mgatk -c 8 -bt CB -b $sample_id/outs/filtered_feature_bc_matrix/barcodes1.tsv -z
rm $sample_id/outs/filtered_feature_bc_matrix/barcodes1.tsv
The critical part I found here is only specifying 8 cores for mgatk tenx
.
Describe the bug
The command generates only the
chrM_refAllele.txt
.A summary of .log files
mgatk.snakemake_tenx.log
base.mgatk.log
Post an ls -lRh of mgatk_output_folder
Describe the sequencing assay being analyzed
The assay is scATAC-seq from the 10X dataset. I'm directly downloading the BAM file, index, and barcode TSV from 10X. https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-3-k-1-standard-2-0-0
Clarify if the execution successful on the test data provided in the repository
It does not work on the test data (see #59)
Additional context
I've already tried solutions in other issues such as
--snake-stdout
and--keep-duplicates
#46.