Container & documentation

Zethson commented 3 years ago

Hi,

would it be possible that you provide a container with a full environment to get the pipeline running, please? A small example dataset would also help a lot. Moreover, the pipeline parameters and flags etc are not document yet.

Thank you very much!

petervangalen commented 2 years ago

Do the recent updates answer your question? I believe the small example dataset is present in tests and can be run along these lines:

maegtk bcall -i data/test_maester.bam -o test_maester -z

Zethson commented 2 years ago

Currently doing other things. Will revisit this at a later point. Thank you for replying though!

JZL commented 2 years ago

On a similar note, adding the R package dependencies to the FAQ would be helpful. I was moving fast so didn't realize it also needed SummarizedExperiment so it almost finished and then failed

For what it's worth, I didn't do a full container but used conda-forge to install the dependencies for only this project, and it was nice. You could even set up a maegatk conda(-forge) package and it would take care of installing all the binary dependencies for future users

caleblareau commented 2 years ago

added some notes on the R package install-- thanks for that note @JZL

I don't have any direct experience making a conda package but would welcome a pull request from anyone for this!

JZL commented 2 years ago

Hi,

These are just some scattered thoughts getting it to run. I can make a more formal PR with these changes in the future

To get this running for future people. You also need to install samtools and run bwa index ***.fa first, else it fails.

Additionally:

pysam, bwa mem, and samtools sort support a threads argument which seemed to help speed it up a little. (Although I learned the hard way that if you do this within multi_file_manager it causes a thread bomb an crashes the machine!)
In https://github.com/caleblareau/maegatk/blob/6662c7c4188c2c254e9416450295553700473794/maegatk/bin/python/split_barcoded_bam.py#L39 within writePassingReads where it checks if read_barcode is in bc and then gets its index, when bc is a list it iteratively checks all the items for each call. Which is a lot of iterative checking for every read of big bams. I didn't profile the difference but I thought I might as well define bcDict = dict((x, i) for i, x in enumerate(bc)) and then use the dictionary to check if the read_barcode is in bc and to get the index in constant time

alecpnkw commented 7 months ago

Do the recent updates answer your question? I believe the small example dataset is present in tests and can be run along these lines:
maegtk bcall -i data/test_maester.bam -o test_maester -z

I am having a similar issue when trying to run this small test dataset as described. Is there a specific snakemake version that maegatk has been tested with? Trying to isolate what is going wrong for me. The pipeline seems to run fine for the 'splitting' part, but then fails to submit the snakemake job and perform the remaining analyses:

Thu Mar 28 22:36:19 EDT 2024: maegatk v0.2.0
Thu Mar 28 22:36:19 EDT 2024: Found bam file: tests/data/test_maester.bam for genotyping.
Thu Mar 28 22:36:19 EDT 2024: Will determine barcodes with at least: 100 mitochondrial reads.
Thu Mar 28 22:36:19 EDT 2024: User specified mitochondrial genome matches .bam file
Thu Mar 28 22:36:23 EDT 2024: Finished determining/splitting barcodes for genotyping.
Thu Mar 28 22:36:23 EDT 2024: Genotyping samples with 2 threads

where the above is the end of the log message. If I look in "final/" this is all I see:

[final]$ ls
barcodeQuants.tsv  chrM_refAllele.txt  passingBarcodes.tsv

My environment for running maegatk looks like this and is based on your description in the readme.

name: "maegatk"
channels: 
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - openjdk
  - python >= 3.8
  - snakemake == 6.15.5
  - bwa == 0.7.17
  - samtools >= 1.9
  - freebayes
  - r-base == 4.0.3 # dplyr, data.table, Matrix, GenomicRanges, SummarizedExperiment
  - pip:
    - maegatk == 0.2.0
    - ruamel.yaml <= 0.18.0

Thanks!

noranekonobokkusu commented 7 months ago

@alecpnkw do you have any log files in logs/? My snakemake version is 7.18.2, but I don't think maegatk was ever tested with a particular version of it.

If nothing informative comes from logs/, try modifying snakecmd_scatter and snakecmd_gather commands in your cli.py by providing a full path to snakemake in case your system doesn't see it, like

snakecmd_scatter = '/full/path/to/snakemake'+snakeclust+' --snakefile ' + script_dir + '/bin/snake/Snakefile.maegatk.Scatter --cores '+ncores+' --config cfp="'  + y_s + '" --stats '+snake_stats + snake_log_out

If that won't work either, try this: Following this issue, try upgrading ruamel.yaml to the current version, then make the modifications described below, then remove the output directory and rerun it from scratch:

Replace your entire cli.py with the code from https://github.com/caleblareau/maegatk/blob/master/maegatk/cli.py
Modify oneSample_maegatk.py: change this from ruamel import yaml into this from ruamel.yaml import YAML then change this config = yaml.load(stream, Loader=yaml.Loader) into this
```
yaml = YAML(typ='rt')
config = yaml.load(stream)
```

alecpnkw commented 7 months ago

Thanks, knowing that a later snakemake versions runs in helpful. I'd like to try to keep the source the same as v0.2.0 for deployment reasons but happy to test if I can't get anything else to work. Based on the issues you linked I tweaked my environment to change the ruamel.yaml version to 0.17 and now am getting better results (!). It looks like the final output is there (final/maegatk.rds), and final/maegatk.depthTable.txt looks like this:

TGGTAGTGAACC-1  57.35
GGGTTCGCCTCC-1  66.34
TTCCTACGCAAT-1  55.25
CCCGTCGTGGTA-1  53.17
CCACAAAACATG-1  48.12
GTACCCACAGCC-1  43.3
GGCGCTAATGAA-1  67.39
ACAATTAGCACT-1  51.39
CTAACCCGGAAT-1  42.79
GACCGTGCATTT-1  52.64

Still, I am not getting any logs inside logs/rmdupslogs and getting dumped cores that I think correspond to the java jobs. Should I be seeing logs for rmdups in the test output?

noranekonobokkusu commented 7 months ago

It seems your test job has finished successfully. When I suggested checking the logs, I was referring to .log files, like maegatk.snakemake_scatter.log or maegatk.snakemake_gather.log which are usually helpful. I am not the original developer of maegatk so my knowledge on this part is rather limited and I cannot comment on the expected content of rmdupslogs/, but my rmdupslogs/ is empty too (at least for the test dataset).

alecpnkw commented 7 months ago

Awesome, thanks. Is there someplace I can access a test final/maegatk.rds file to make sure I'm generating all the intended assay slots and coldata?

Spirometra commented 7 months ago

@alecpnkw do you have any log files in logs/? My snakemake version is 7.18.2, but I don't think maegatk was ever tested with a particular version of it.

If nothing informative comes from logs/, try modifying snakecmd_scatter and snakecmd_gather commands in your cli.py by providing a full path to snakemake in case your system doesn't see it, like
snakecmd_scatter = '/full/path/to/snakemake'+snakeclust+' --snakefile ' + script_dir + '/bin/snake/Snakefile.maegatk.Scatter --cores '+ncores+' --config cfp="'  + y_s + '" --stats '+snake_stats + snake_log_out
If that won't work either, try this: Following this issue, try upgrading ruamel.yaml to the current version, then make the modifications described below, then remove the output directory and rerun it from scratch:

Replace your entire cli.py with the code from https://github.com/caleblareau/maegatk/blob/master/maegatk/cli.py

Modify oneSample_maegatk.py: change this from ruamel import yaml into this from ruamel.yaml import YAML then change this config = yaml.load(stream, Loader=yaml.Loader) into this
yaml = YAML(typ='rt')
config = yaml.load(stream)

Hi there @noranekonobokkusu. I came across similar problems using the test data as you did in issue #5 . First I uninstalled ruamel.yaml (current version 0.18.6) and then installed ruamel.yaml==0.17 and still get the same issues, so back to 0.18.6. I've tried snakemake 6.6.1 and 7.18.2. I tried specifying snakemake path in the updated cli.py file, and the modification to the oneSample_maegatk.py as suggested. Still I get the same error output:

Error in checkGrep(grep(".A.txt", files)) : Improper folder specification; file missing / extra file present. See documentation Calls: importMito -> checkGrep Execution halted

base.maegatk.log does not show errors, however I do not see the .rds file in /final. but in maegatk.snakemake_gather.log I get an error about "Error in rule make_final_sparse_matrices"and in maegatk.snakemake_scatter.log some Java and samtools error. I've tried version Java 1.7.0_261 and Java/1.8.0_144 (this is with 1.8.0_144). See attached log files.

base.maegatk.log maegatk.snakemake_gather (1).log maegatk.snakemake_scatter (1).log

What would your recommendation be to get this to run? I'd really like to use this tool!

noranekonobokkusu commented 6 months ago

Hi @Spirometra sorry for the late reply. I can see bwa is not available in your path (from .scatter.log): sh: bwa: command not found Try adding BWA to your environment and rerunning.

caleblareau / maegatk

Container & documentation #3