caleblareau / mgatk

mgatk: mitochondrial genome analysis toolkit
http://caleblareau.github.io/mgatk
MIT License
98 stars 25 forks source link

Missing files for mgatk tenx #59

Closed dy-lin closed 1 year ago

dy-lin commented 2 years ago

Describe the bug

The tool does not throw a specific error and looks to have worked, but the logfile shows a traceback.

mgatk tenx -i barcode/test_barcode.bam -n bc1 -o bc1dmem -bt CB -b barcode/test_barcodes.txt -c 2

A summary of .log files

Thu May 12 17:37:28 PDT 2022: Starting analysis with mgatk
Thu May 12 17:37:28 PDT 2022: Processing samples with 2 threads
Thu May 12 17:37:46 PDT 2022: mgatk successfully processed the supplied .bam files
Thu May 12 17:38:06 PDT 2022: Successfully created final output files
Thu May 12 17:38:06 PDT 2022: Intermediate files successfully removed.
Thu May 12 17:40:01 PDT 2022: Starting analysis with mgatk
Thu May 12 17:40:01 PDT 2022: Processing samples with 2 threads
Thu May 12 17:40:22 PDT 2022: mgatk successfully processed the supplied .bam files
Thu May 12 17:40:37 PDT 2022: Successfully created final output files
Thu May 12 17:40:37 PDT 2022: Intermediate files successfully removed.
Thu May 12 17:47:04 PDT 2022: Starting analysis with mgatk
Thu May 12 17:47:04 PDT 2022: Processing samples with 2 threads
Thu May 12 17:47:25 PDT 2022: mgatk successfully processed the supplied .bam files
Thu May 12 17:47:42 PDT 2022: Successfully created final output files
Thu May 12 17:47:42 PDT 2022: Intermediate files successfully removed.
Thu May 12 17:49:16 PDT 2022: Starting analysis with mgatk
Thu May 12 17:49:16 PDT 2022: Processing samples with 2 threads
Thu May 12 17:50:16 PDT 2022: mgatk successfully processed the supplied .bam files
Thu May 12 17:50:38 PDT 2022: Successfully created final output files
Thu May 12 17:50:38 PDT 2022: Intermediate files successfully removed.
Thu May 12 17:57:14 PDT 2022: Starting analysis with mgatk
Thu May 12 17:57:14 PDT 2022: Processing samples with 2 threads
Thu May 12 18:02:23 PDT 2022: mgatk successfully processed the supplied .bam files
Thu May 12 18:02:40 PDT 2022: Successfully created final output files
Thu May 12 18:02:40 PDT 2022: Intermediate files successfully removed.
Thu May 12 18:04:43 PDT 2022: Starting analysis with mgatk
Thu May 12 18:04:43 PDT 2022: Processing samples with 2 threads
Thu May 12 18:09:57 PDT 2022: mgatk successfully processed the supplied .bam files
Thu May 12 18:10:10 PDT 2022: Successfully created final output files
Thu May 12 18:10:10 PDT 2022: Intermediate files successfully removed.
Thu May 12 18:25:35 PDT 2022: Starting analysis with mgatk
Thu May 12 18:25:35 PDT 2022: Processing samples with 2 threads
Thu May 12 18:29:19 PDT 2022: mgatk successfully processed the supplied .bam files
Thu May 12 18:29:29 PDT 2022: Successfully created final output files
Thu May 12 18:29:29 PDT 2022: Intermediate files successfully removed.
Thu May 12 19:18:13 PDT 2022: Starting analysis with mgatk
Thu May 12 19:18:13 PDT 2022: Processing samples with 2 threads
Thu May 12 19:21:25 PDT 2022: mgatk successfully processed the supplied .bam files
Thu May 12 19:21:35 PDT 2022: Successfully created final output files
Thu May 12 19:21:36 PDT 2022: Intermediate files successfully removed.
Fri May 13 11:25:46 PDT 2022: Starting analysis with mgatk
Fri May 13 11:25:46 PDT 2022: Processing samples with 2 threads
Fri May 13 11:26:24 PDT 2022: mgatk successfully processed the supplied .bam files
Fri May 13 11:26:34 PDT 2022: Successfully created final output files
Fri May 13 11:26:34 PDT 2022: Intermediate files successfully removed.
Fri May 13 11:35:52 PDT 2022: Starting analysis with mgatk
Fri May 13 11:35:52 PDT 2022: Processing samples with 2 threads
Fri May 13 11:36:29 PDT 2022: mgatk successfully processed the supplied .bam files
Fri May 13 11:36:40 PDT 2022: Successfully created final output files
Fri May 13 11:36:40 PDT 2022: Intermediate files successfully removed.
Config file bc1dmem/.internal/parseltongue/snake.scatter.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
Using shell: /projects/karsanlab/dlin_dev/software/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job stats:
job                           count    min threads    max threads
--------------------------  -------  -------------  -------------
all                               1              1              1
call_variants                     1              1              1
make_depth_table                  1              1              1
make_final_sparse_matrices        1              1              1
process_one_slice                 2              1              1
total                             6              1              1

Select jobs to execute...

[Fri May 13 11:35:54 2022]
rule process_one_slice:
    input: bc1dmem/.internal/samples/barcodes.1.bam.txt
    output: bc1dmem/qc/depth/barcodes.1.depth.txt, bc1dmem/temp/sparse_matrices/barcodes.1.A.txt, bc1dmem/temp/sparse_matrices/barcodes.1.C.txt, bc1dmem/temp/sparse_matrices/barcodes.1.G.txt, bc1dmem/temp/sparse_matrices/barcodes.1.T.txt, bc1dmem/temp/sparse_matrices/barcodes.1.coverage.txt
    jobid: 2
    wildcards: sample=barcodes.1
    resources: tmpdir=/tmp

[Fri May 13 11:35:54 2022]
rule process_one_slice:
    input: bc1dmem/.internal/samples/barcodes.2.bam.txt
    output: bc1dmem/qc/depth/barcodes.2.depth.txt, bc1dmem/temp/sparse_matrices/barcodes.2.A.txt, bc1dmem/temp/sparse_matrices/barcodes.2.C.txt, bc1dmem/temp/sparse_matrices/barcodes.2.G.txt, bc1dmem/temp/sparse_matrices/barcodes.2.T.txt, bc1dmem/temp/sparse_matrices/barcodes.2.coverage.txt
    jobid: 3
    wildcards: sample=barcodes.2
    resources: tmpdir=/tmp

[Fri May 13 11:36:08 2022]
Finished job 3.
1 of 6 steps (17%) done
[Fri May 13 11:36:18 2022]
Finished job 2.
2 of 6 steps (33%) done
Select jobs to execute...

[Fri May 13 11:36:18 2022]
rule make_depth_table:
    input: bc1dmem/qc/depth/barcodes.1.depth.txt, bc1dmem/qc/depth/barcodes.2.depth.txt
    output: bc1dmem/final/bc1.depthTable.txt
    jobid: 1
    resources: tmpdir=/tmp

[Fri May 13 11:36:18 2022]
rule make_final_sparse_matrices:
    input: bc1dmem/temp/sparse_matrices/barcodes.1.A.txt, bc1dmem/temp/sparse_matrices/barcodes.2.A.txt, bc1dmem/temp/sparse_matrices/barcodes.1.C.txt, bc1dmem/temp/sparse_matrices/barcodes.2.C.txt, bc1dmem/temp/sparse_matrices/barcodes.1.G.txt, bc1dmem/temp/sparse_matrices/barcodes.2.G.txt, bc1dmem/temp/sparse_matrices/barcodes.1.T.txt, bc1dmem/temp/sparse_matrices/barcodes.2.T.txt, bc1dmem/temp/sparse_matrices/barcodes.1.coverage.txt, bc1dmem/temp/sparse_matrices/barcodes.2.coverage.txt
    output: bc1dmem/final/bc1.A.txt.gz, bc1dmem/final/bc1.C.txt.gz, bc1dmem/final/bc1.G.txt.gz, bc1dmem/final/bc1.T.txt.gz, bc1dmem/final/bc1.coverage.txt.gz
    jobid: 4
    resources: tmpdir=/tmp

[Fri May 13 11:36:19 2022]
Finished job 1.
3 of 6 steps (50%) done
[Fri May 13 11:36:20 2022]
Finished job 4.
4 of 6 steps (67%) done
Select jobs to execute...

[Fri May 13 11:36:20 2022]
rule call_variants:
    input: bc1dmem/final/bc1.A.txt.gz, bc1dmem/final/bc1.C.txt.gz, bc1dmem/final/bc1.G.txt.gz, bc1dmem/final/bc1.T.txt.gz, bc1dmem/final/chrM_refAllele.txt
    output: bc1dmem/final/bc1.variant_stats.tsv.gz, bc1dmem/final/bc1.cell_heteroplasmic_df.tsv.gz, bc1dmem/final/bc1.vmr_strand_plot.png
    jobid: 5
    resources: tmpdir=/tmp

Traceback (most recent call last):
  File "/home/dlin/.local/lib/python3.7/site-packages/mgatk/bin/python/variant_calling.py", line 90, in <module>
    base_coverage_dict = load_mgatk_output(MGATK_OUT_DIR, mito_length)
  File "/home/dlin/.local/lib/python3.7/site-packages/mgatk/bin/python/variant_calling.py", line 30, in load_mgatk_output
    fwd_base_df[missing_pos] = 0  # fill in missing positions
  File "/projects/karsanlab/software/linux-x86_64-centos7/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3367, in __setitem__
    self._setitem_array(key, value)
  File "/projects/karsanlab/software/linux-x86_64-centos7/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3393, in _setitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File "/projects/karsanlab/software/linux-x86_64-centos7/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1354, in _convert_to_indexer
    return self._get_listlike_indexer(obj, axis, **kwargs)[1]
  File "/projects/karsanlab/software/linux-x86_64-centos7/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1161, in _get_listlike_indexer
    raise_missing=raise_missing)
  File "/projects/karsanlab/software/linux-x86_64-centos7/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1246, in _validate_read_indexer
    key=key, axis=self.obj._get_axis_name(axis)))
KeyError: "None of [Int64Index([    1,     4,     6,     8,     9,    11,    12,    14,    17,\n               18,\n            ...\n            16548, 16550, 16555, 16558, 16560, 16562, 16565, 16566, 16568,\n            16569],\n           dtype='int64', length=8550)] are in the [columns]"
MissingOutputException in line 160 of /home/dlin/.local/lib/python3.7/site-packages/mgatk/bin/snake/Snakefile.tenx:
Job Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
bc1dmem/final/bc1.variant_stats.tsv.gz
bc1dmem/final/bc1.cell_heteroplasmic_df.tsv.gz
bc1dmem/final/bc1.vmr_strand_plot.png completed successfully, but some output files are missing. 0
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2022-05-13T113553.465159.snakemake.log

Post an ls -lRh of mgatk_output_folder

bc1dmem/:
total 12K
drwxrwsr-x 2 dlin karsanlab 4.0K May 13 11:36 final
drwxrwsr-x 4 dlin karsanlab 4.0K May 12 18:02 logs
drwxrwsr-x 4 dlin karsanlab 4.0K May 12 17:49 qc

bc1dmem/final:
total 1.7M
-rw-rw-r-- 1 dlin karsanlab  96K May 13 11:36 bc1.A.txt.gz
-rw-rw-r-- 1 dlin karsanlab 194K May 13 11:36 bc1.coverage.txt.gz
-rw-rw-r-- 1 dlin karsanlab  96K May 13 11:36 bc1.C.txt.gz
-rw-rw-r-- 1 dlin karsanlab   76 May 13 11:36 bc1.depthTable.txt
-rw-rw-r-- 1 dlin karsanlab  50K May 13 11:36 bc1.G.txt.gz
-rw-rw-r-- 1 dlin karsanlab 537K May 13 11:36 bc1.rds
-rw-rw-r-- 1 dlin karsanlab 477K May 13 11:36 bc1.signac.rds
-rw-rw-r-- 1 dlin karsanlab  77K May 13 11:36 bc1.T.txt.gz
-rw-rw-r-- 1 dlin karsanlab 119K May 13 11:35 chrM_refAllele.txt

bc1dmem/logs:
total 36K
-rw-rw-r-- 1 dlin karsanlab 3.4K May 13 11:36 base.mgatk.log
-rw-rw-r-- 1 dlin karsanlab  475 May 13 11:35 bc1.parameters.txt
-rw-rw-r-- 1 dlin karsanlab 5.8K May 13 11:36 bc1.snakemake_tenx.log
-rw-rw-r-- 1 dlin karsanlab 8.7K May 12 19:21 bc1.snakemake_tenx.stats
drwxrwsr-x 2 dlin karsanlab 4.0K May 12 17:37 filterlogs
drwxrwsr-x 2 dlin karsanlab 4.0K May 12 17:37 rmdupslogs

bc1dmem/logs/filterlogs:
total 0
-rw-rw-r-- 1 dlin karsanlab 22 May 13 11:35 barcodes.1.filter.log
-rw-rw-r-- 1 dlin karsanlab 21 May 13 11:35 barcodes.2.filter.log

bc1dmem/logs/rmdupslogs:
total 8.0K
-rw-rw-r-- 1 dlin karsanlab 1.5K May 13 11:36 barcodes.1.rmdups.log
-rw-rw-r-- 1 dlin karsanlab 1.5K May 13 11:36 barcodes.2.rmdups.log

bc1dmem/qc:
total 8.0K
drwxrwsr-x 2 dlin karsanlab 4.0K May 13 11:36 depth
drwxrwsr-x 2 dlin karsanlab 4.0K May 12 17:37 quality

bc1dmem/qc/depth:
total 0
-rw-rw-r-- 1 dlin karsanlab 50 May 13 11:36 barcodes.1.depth.txt
-rw-rw-r-- 1 dlin karsanlab 26 May 13 11:36 barcodes.2.depth.txt

bc1dmem/qc/quality:
total 0

Describe the sequencing assay being analyzed This is the test dataset and command provided by mgatk.

Clarify if the execution successful on the test data provided in the repository

It does not work on the test data.

Additional context

Using python 3.7.3.

caleblareau commented 2 years ago

it looks like the overall run worked because of the ACGT text files that have been put out.

@vincent6liu can you run with the test data as shown here? It looks like its failing at the step that you created, which is possibly due to only there being 2 cells?

dy-lin commented 2 years ago

I tried refreshing my python environment (create a conda environment, and then install mGATK with pip), and that seems to have helped. There are some warning messages, but looks like it completed.

This time using python 3.9.12.

Log files: base.mgatk.log

The snakemake file was too large to be uploaded, but here is the tail end:

/projects/karsanlab/dlin_dev/software/.conda/envs/mGATK3/lib/python3.9/site-packages/mgatk/bin/python/variant_calling.py:38: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  rev_base_df[missing_pos] = 0
/projects/karsanlab/dlin_dev/software/.conda/envs/mGATK3/lib/python3.9/site-packages/mgatk/bin/python/variant_calling.py:159: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  'mean_coverage', 'max_heteroplasmy']].astype(np.float)
/home/dlin/.conda/envs/mGATK3/lib/python3.9/site-packages/pandas/core/arraylike.py:397: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)
[Tue May 24 10:58:05 2022]
Finished job 5.
5 of 6 steps (83%) done
Select jobs to execute...

[Tue May 24 10:58:05 2022]
localrule all:
    input: bc1dmem/final/bc1.depthTable.txt, bc1dmem/final/bc1.A.txt.gz, bc1dmem/final/bc1.C.txt.gz, bc1dmem/final/bc1.G.txt.gz, bc1dmem/final/bc1.T.txt.gz, bc1dmem/final/bc1.coverage.txt.gz, bc1dmem/final/bc1.variant_stats.tsv.gz, bc1dmem/final/bc1.cell_heteroplasmic_df.tsv.gz, bc1dmem/final/bc1.vmr_strand_plot.png
    jobid: 0
    reason: Input files updated by another job: bc1dmem/final/bc1.vmr_strand_plot.png, bc1dmem/final/bc1.coverage.txt.gz, bc1dmem/final/bc1.depthTable.txt, bc1dmem/final/bc1.T.txt.gz, bc1dmem/final/bc1.G.txt.gz, bc1dmem/final/bc1.C.txt.gz, bc1dmem/final/bc1.A.txt.gz, bc1dmem/final/bc1.variant_stats.tsv.gz, bc1dmem/final/bc1.cell_heteroplasmic_df.tsv.gz
    resources: tmpdir=/var/tmp

[Tue May 24 10:58:05 2022]
Finished job 0.
6 of 6 steps (100%) done
Complete log: .snakemake/log/2022-05-24T105549.715466.snakemake.log
bc1dmem/:
total 12K

drwxrwsr-x 2 dlin karsanlab 4.0K May 24 10:58 final
drwxrwsr-x 4 dlin karsanlab 4.0K May 12 18:02 logs
drwxrwsr-x 4 dlin karsanlab 4.0K May 12 17:49 qc

bc1dmem/final:
total 1.8M
-rw-rw-r-- 1 dlin karsanlab  96K May 24 10:56 bc1.A.txt.gz
-rw-rw-r-- 1 dlin karsanlab  519 May 24 10:58 bc1.cell_heteroplasmic_df.tsv.gz
-rw-rw-r-- 1 dlin karsanlab 194K May 24 10:56 bc1.coverage.txt.gz
-rw-rw-r-- 1 dlin karsanlab  96K May 24 10:56 bc1.C.txt.gz
-rw-rw-r-- 1 dlin karsanlab   76 May 24 10:56 bc1.depthTable.txt
-rw-rw-r-- 1 dlin karsanlab  50K May 24 10:56 bc1.G.txt.gz
-rw-rw-r-- 1 dlin karsanlab 537K May 24 10:58 bc1.rds
-rw-rw-r-- 1 dlin karsanlab 477K May 24 10:58 bc1.signac.rds
-rw-rw-r-- 1 dlin karsanlab  77K May 24 10:56 bc1.T.txt.gz
-rw-rw-r-- 1 dlin karsanlab  77K May 24 10:58 bc1.variant_stats.tsv.gz
-rw-rw-r-- 1 dlin karsanlab  27K May 24 10:58 bc1.vmr_strand_plot.png
-rw-rw-r-- 1 dlin karsanlab 119K May 24 10:55 chrM_refAllele.txt

bc1dmem/logs:
total 34M
-rw-rw-r-- 1 dlin karsanlab 4.1K May 24 10:58 base.mgatk.log
-rw-rw-r-- 1 dlin karsanlab  514 May 24 10:55 bc1.parameters.txt
-rw-rw-r-- 1 dlin karsanlab  34M May 24 10:58 bc1.snakemake_tenx.log
-rw-rw-r-- 1 dlin karsanlab 8.8K May 24 10:58 bc1.snakemake_tenx.stats
drwxrwsr-x 2 dlin karsanlab 4.0K May 12 17:37 filterlogs
drwxrwsr-x 2 dlin karsanlab 4.0K May 12 17:37 rmdupslogs

bc1dmem/logs/filterlogs:
total 0
-rw-rw-r-- 1 dlin karsanlab 22 May 24 10:55 barcodes.1.filter.log
-rw-rw-r-- 1 dlin karsanlab 21 May 24 10:55 barcodes.2.filter.log

bc1dmem/logs/rmdupslogs:
total 8.0K
-rw-rw-r-- 1 dlin karsanlab 1.5K May 24 10:55 barcodes.1.rmdups.log
-rw-rw-r-- 1 dlin karsanlab 1.5K May 24 10:55 barcodes.2.rmdups.log

bc1dmem/qc:
total 8.0K
drwxrwsr-x 2 dlin karsanlab 4.0K May 24 10:56 depth
drwxrwsr-x 2 dlin karsanlab 4.0K May 12 17:37 quality

bc1dmem/qc/depth:
total 0
-rw-rw-r-- 1 dlin karsanlab 50 May 24 10:56 barcodes.1.depth.txt
-rw-rw-r-- 1 dlin karsanlab 26 May 24 10:56 barcodes.2.depth.txt

bc1dmem/qc/quality:
total 0
dy-lin commented 2 years ago

@caleblareau do these errors affect final output files?

/projects/karsanlab/dlin_dev/software/.conda/envs/mGATK/lib/python3.9/site-packages/mgatk/bin/python/variant_calling.py:38: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
  rev_base_df[missing_pos] = 0
/projects/karsanlab/dlin_dev/software/.conda/envs/mGATK/lib/python3.9/site-packages/mgatk/bin/python/variant_calling.py:159: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  'mean_coverage', 'max_heteroplasmy']].astype(np.float)
/home/dlin/.conda/envs/mGATK/lib/python3.9/site-packages/pandas/core/arraylike.py:397: RuntimeWarning: divide by zero encountered in log10
caleblareau commented 2 years ago

These shouldn't impact that final output files at all-- looks like you should be set! glad you were able to debug this.