MikkelSchubert / paleomix

Pipelines and tools for the processing of ancient and modern HTS data.
https://paleomix.readthedocs.io/en/stable/
MIT License
43 stars 19 forks source link

Paleomix expecting gatk.jar to be in a specific folder? #28

Closed gtrichard closed 4 years ago

gtrichard commented 4 years ago

One of my user tried to launch paleomix after a successful pip install and got the following error:

paleomix bam_pipeline run makefile.yaml
Reading makefiles ...
- Validating prefixes ...
Building BAM pipeline .
Running BAM pipeline ...
- Checking file dependencies ...
Errors detected during graph construction (max 20 shown):
Required file does not exist, and is not created by a node:
Filename: /home/genouest/ecobio/mollivier/install/jar_root/GenomeAnalysisTK.jar
Dependent node(s): <GATK Indel Realigner (aligning): 2 files in 'SAMPLE/BDD_METAZOA/Sample_1' -> 'SAMPLE.BDD_METAZOA.realigned.bam'>
<GATK Indel Realigner (training): 2 files in 'SAMPLE/BDD_METAZOA/Sample_1' -> 'SAMPLE/BDD_METAZOA.intervals'>
Required file does not exist, and is not created by a node:
Filename: /home/genouest/ecobio/mollivier/install/jar_root/picard.jar
Dependent node(s): <DepthHistogram: 2 files in 'SAMPLE/BDD_METAZOA/Sample_1' -> 'SAMPLE.BDD_METAZOA.depths'>
<MarkDuplicates: 3 files in 'SAMPLE/BDD_METAZOA/Sample_1/SL383339/Lane_1'>
<SequenceDictionary: 'prefixes/BDD_METAZOA.fasta'>
<Validate BAM: 'SAMPLE.BDD_METAZOA.realigned.bam'>
<Validate BAM: 'SAMPLE/BDD_METAZOA/Sample_1/SL383339.rmdup.collapsed.bam'>
and 11 more nodes ... 

I read paleomix/nodes/gatk.py and don't understand why not replacing the search for gatk.jar by a simple gatk call that would be managed by a conda env installation...

Could you otherwise specify a step by step guide on how to install the various dependencies in the correct folders that paleomix expect?

Thank you.

gtrichard commented 4 years ago

There's a doc about it https://paleomix.readthedocs.io/en/latest/bam_pipeline/requirements.html

But which version of gatk / picard are needed ? This should be taken care of by an installation script to make the usage of paleomix easier.

MikkelSchubert commented 4 years ago

I read paleomix/nodes/gatk.py and don't understand why not replacing the search for gatk.jar by a simple gatk call that would be managed by a conda env installation...

The answer to that is simple: I don't use conda and I don't expect the end-user to use it either.

But which version of gatk / picard are needed ? This should be taken care of by an installation script to make the usage of paleomix easier.

For GATK you need a version prior to 4.0, since the Indel Realigner tool used by paleomix has been removed in v4. Due to this removal and due to BI making it painful to locate old versions of GATK, use of GATK is being deprecated in the next major version of paleomix. For Picard any recent version will do.

Having an install script is a good idea, so I'll look at making the next major version of paleomix download the jar(s) automatically.

gtrichard commented 4 years ago

Great thanks.