DiltheyLab / HLA-LA

Fast HLA type inference from whole-genome data
GNU General Public License v3.0
123 stars 42 forks source link

Decouple graphs directory with HLA-LA installation #96

Open skchronicles opened 1 year ago

skchronicles commented 1 year ago

Hello there,

I hope all is going well on your side, and that you are having a wonderful day!

I just built a docker image for HLA-LA. For anyone interested, the Dockerfile is located here: https://github.com/OpenOmics/genome-seek/blob/main/docker/genome-seek/hla/Dockerfile

And it can be pulled from here, using the command below:

docker pull skchronicles/genome-seek_hla:v0.1.0

After testing out the image, I noticed one particularity due to where I installed the graph reference files:

[Fri Apr 28 16:40:23 2023]
Job 0: Running HLA*LA on '/data/dev/genome-seek/test_subsampled_docker/BAM/HG004.sorted.bam' input file
Reason: Missing output files: /data/dev/genome-seek/test_subsampled_docker/HLA/HG004/sample/hla/R1_bestguess_G.txt

HLA-LA.pl \
    --BAM /data/dev/genome-seek/test_subsampled_docker/BAM/HG004.sorted.bam \
    --graph /data/OpenOmics/references/genome-seek/HLA-LA/graphs/PRG_MHC_GRCh38_withIMGT \
    --sampleID sample \
    --maxThreads 8 \
    --workingDir /data/dev/genome-seek/test_subsampled_docker/HLA/HG004

Activating singularity image /data/dev/genome-seek/test_subsampled_docker/.snakemake/singularity/ba473d7fec36d9d46ceceb85188e9863.simg
HLA-LA.pl

Identified paths:
    samtools_bin: /usr/local/bin/samtools
    bwa_bin: /usr/bin/bwa
    java_bin: /usr/bin/java
    picard_sam2fastq_bin: /opt2/picard/2.27.5/picard.jar
    General working directory: /data/dev/genome-seek/test_subsampled_docker/HLA/HG004
    Sample-specific working directory: /data/dev/genome-seek/test_subsampled_docker/HLA/HG004/sample

Graph directory /opt2/hla-la/1.0.3/HLA-LA/src/../graphs//data/OpenOmics/references/genome-seek/HLA-LA/graphs/PRG_MHC_GRCh38_withIMGT not found - valid graph names are subdirectories of the graphs directory in the HLA-LA root at /opt2/hla-la/1.0.3/HLA-LA/src/HLA-LA.pl line 247.

I cannot bundle/install the graph reference files within the docker image due to its size (~29GB). It looks the graphs directory must exist in a specific location relative to the HLA-LA installation. I am just wondering if it would be possible to decouple the graph directory (i.e. your reference files) from the HLA-LA installation. This extra flexibility would be great for docker/singularity users, as it would allow using the tool without any complicated binding. It would also give sysadmins more flexibility regarding how/where to install the tool.

Right now, I am running your tool within a Snakemake pipeline. Snakemake abstracts away some of the commands related to running X tool within a docker container, as long as you provide a bind list up-front. With that being said, it keeps the commands clean and interoperable with other software management tools. This allows any given step/command in your pipeline to run using environment modules, conda, or docker/singularity. I could directly call singularity within my pipeline here and bind the host graphs path to the containers graph path (relative to where HLA-LA is installed); however, the command would only be compatible with docker/singularity (and it would no longer work with environment modules or conda).

With that being said, if it would not be too much trouble, could you please update the --graphs option so it will work with reference files installed in another location? There is no immediate rush. I am just hoping this can be added in the next release or whenever you have some free time.

Please let me know what you think.

Best Regards, @skchronicles

AlexanderDilthey commented 1 year ago

Hi @skchronicles,

Thank you for your note!

HLA-LA.pl has the (undocumented) customGraphDir command line parameter - it seems that this is what you may need?

Best wishes

Alex