Open Zethson opened 3 years ago
Do the recent updates answer your question? I believe the small example dataset is present in tests and can be run along these lines:
maegtk bcall -i data/test_maester.bam -o test_maester -z
Currently doing other things. Will revisit this at a later point. Thank you for replying though!
On a similar note, adding the R package dependencies to the FAQ would be helpful. I was moving fast so didn't realize it also needed SummarizedExperiment
so it almost finished and then failed
For what it's worth, I didn't do a full container but used conda-forge to install the dependencies for only this project, and it was nice. You could even set up a maegatk conda(-forge) package and it would take care of installing all the binary dependencies for future users
added some notes on the R package install-- thanks for that note @JZL
I don't have any direct experience making a conda package but would welcome a pull request from anyone for this!
Hi,
These are just some scattered thoughts getting it to run. I can make a more formal PR with these changes in the future
To get this running for future people. You also need to install samtools
and run bwa index ***.fa
first, else it fails.
Additionally:
pysam
, bwa mem
, and samtools sort
support a threads argument which seemed to help speed it up a little. (Although I learned the hard way that if you do this within multi_file_manager
it causes a thread bomb an crashes the machine!)writePassingReads
where it checks if read_barcode
is in bc
and then gets its index, when bc
is a list it iteratively checks all the items for each call. Which is a lot of iterative checking for every read of big bams. I didn't profile the difference but I thought I might as well define bcDict = dict((x, i) for i, x in enumerate(bc))
and then use the dictionary to check if the read_barcode
is in bc
and to get the index in constant timeDo the recent updates answer your question? I believe the small example dataset is present in tests and can be run along these lines:
maegtk bcall -i data/test_maester.bam -o test_maester -z
I am having a similar issue when trying to run this small test dataset as described. Is there a specific snakemake version that maegatk has been tested with? Trying to isolate what is going wrong for me. The pipeline seems to run fine for the 'splitting' part, but then fails to submit the snakemake job and perform the remaining analyses:
Thu Mar 28 22:36:19 EDT 2024: maegatk v0.2.0
Thu Mar 28 22:36:19 EDT 2024: Found bam file: tests/data/test_maester.bam for genotyping.
Thu Mar 28 22:36:19 EDT 2024: Will determine barcodes with at least: 100 mitochondrial reads.
Thu Mar 28 22:36:19 EDT 2024: User specified mitochondrial genome matches .bam file
Thu Mar 28 22:36:23 EDT 2024: Finished determining/splitting barcodes for genotyping.
Thu Mar 28 22:36:23 EDT 2024: Genotyping samples with 2 threads
where the above is the end of the log message. If I look in "final/" this is all I see:
[final]$ ls
barcodeQuants.tsv chrM_refAllele.txt passingBarcodes.tsv
My environment for running maegatk looks like this and is based on your description in the readme.
name: "maegatk"
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- openjdk
- python >= 3.8
- snakemake == 6.15.5
- bwa == 0.7.17
- samtools >= 1.9
- freebayes
- r-base == 4.0.3 # dplyr, data.table, Matrix, GenomicRanges, SummarizedExperiment
- pip:
- maegatk == 0.2.0
- ruamel.yaml <= 0.18.0
Thanks!
@alecpnkw do you have any log files in logs/? My snakemake version is 7.18.2, but I don't think maegatk was ever tested with a particular version of it.
If nothing informative comes from logs/, try modifying snakecmd_scatter
and snakecmd_gather
commands in your cli.py by providing a full path to snakemake in case your system doesn't see it, like
snakecmd_scatter = '/full/path/to/snakemake'+snakeclust+' --snakefile ' + script_dir + '/bin/snake/Snakefile.maegatk.Scatter --cores '+ncores+' --config cfp="' + y_s + '" --stats '+snake_stats + snake_log_out
If that won't work either, try this: Following this issue, try upgrading ruamel.yaml to the current version, then make the modifications described below, then remove the output directory and rerun it from scratch:
Replace your entire cli.py with the code from https://github.com/caleblareau/maegatk/blob/master/maegatk/cli.py
Modify oneSample_maegatk.py:
change this
from ruamel import yaml
into this
from ruamel.yaml import YAML
then change this
config = yaml.load(stream, Loader=yaml.Loader)
into this
yaml = YAML(typ='rt')
config = yaml.load(stream)
Thanks, knowing that a later snakemake versions runs in helpful. I'd like to try to keep the source the same as v0.2.0 for deployment reasons but happy to test if I can't get anything else to work. Based on the issues you linked I tweaked my environment to change the ruamel.yaml version to 0.17 and now am getting better results (!). It looks like the final output is there (final/maegatk.rds
), and final/maegatk.depthTable.txt
looks like this:
TGGTAGTGAACC-1 57.35
GGGTTCGCCTCC-1 66.34
TTCCTACGCAAT-1 55.25
CCCGTCGTGGTA-1 53.17
CCACAAAACATG-1 48.12
GTACCCACAGCC-1 43.3
GGCGCTAATGAA-1 67.39
ACAATTAGCACT-1 51.39
CTAACCCGGAAT-1 42.79
GACCGTGCATTT-1 52.64
Still, I am not getting any logs inside logs/rmdupslogs
and getting dumped cores that I think correspond to the java jobs. Should I be seeing logs for rmdups in the test output?
It seems your test job has finished successfully. When I suggested checking the logs, I was referring to .log files, like maegatk.snakemake_scatter.log or maegatk.snakemake_gather.log which are usually helpful. I am not the original developer of maegatk so my knowledge on this part is rather limited and I cannot comment on the expected content of rmdupslogs/, but my rmdupslogs/ is empty too (at least for the test dataset).
Awesome, thanks. Is there someplace I can access a test final/maegatk.rds
file to make sure I'm generating all the intended assay slots and coldata?
@alecpnkw do you have any log files in logs/? My snakemake version is 7.18.2, but I don't think maegatk was ever tested with a particular version of it.
If nothing informative comes from logs/, try modifying
snakecmd_scatter
andsnakecmd_gather
commands in your cli.py by providing a full path to snakemake in case your system doesn't see it, likesnakecmd_scatter = '/full/path/to/snakemake'+snakeclust+' --snakefile ' + script_dir + '/bin/snake/Snakefile.maegatk.Scatter --cores '+ncores+' --config cfp="' + y_s + '" --stats '+snake_stats + snake_log_out
If that won't work either, try this: Following this issue, try upgrading ruamel.yaml to the current version, then make the modifications described below, then remove the output directory and rerun it from scratch:
- Replace your entire cli.py with the code from https://github.com/caleblareau/maegatk/blob/master/maegatk/cli.py
- Modify oneSample_maegatk.py: change this
from ruamel import yaml
into thisfrom ruamel.yaml import YAML
then change thisconfig = yaml.load(stream, Loader=yaml.Loader)
into thisyaml = YAML(typ='rt') config = yaml.load(stream)
Hi there @noranekonobokkusu. I came across similar problems using the test data as you did in issue #5 . First I uninstalled ruamel.yaml (current version 0.18.6) and then installed ruamel.yaml==0.17 and still get the same issues, so back to 0.18.6. I've tried snakemake 6.6.1 and 7.18.2. I tried specifying snakemake path in the updated cli.py file, and the modification to the oneSample_maegatk.py as suggested. Still I get the same error output:
Error in checkGrep(grep(".A.txt", files)) : Improper folder specification; file missing / extra file present. See documentation Calls: importMito -> checkGrep Execution halted
base.maegatk.log does not show errors, however I do not see the .rds file in /final. but in maegatk.snakemake_gather.log I get an error about "Error in rule make_final_sparse_matrices"and in maegatk.snakemake_scatter.log some Java and samtools error. I've tried version Java 1.7.0_261 and Java/1.8.0_144 (this is with 1.8.0_144). See attached log files.
base.maegatk.log maegatk.snakemake_gather (1).log maegatk.snakemake_scatter (1).log
What would your recommendation be to get this to run? I'd really like to use this tool!
Hi @Spirometra sorry for the late reply.
I can see bwa
is not available in your path (from .scatter.log):
sh: bwa: command not found
Try adding BWA to your environment and rerunning.
Hi,
would it be possible that you provide a container with a full environment to get the pipeline running, please? A small example dataset would also help a lot. Moreover, the pipeline parameters and flags etc are not document yet.
Thank you very much!