DiltheyLab / HLA-LA

Fast HLA type inference from whole-genome data
GNU General Public License v3.0
120 stars 40 forks source link

Getting started with MinIon FASTQ Files #77

Closed fmobegi closed 11 months ago

fmobegi commented 2 years ago

Am trying to run this tool starting with sequenced samples but can't seem to get it working. Am getting started with HLA-LA for both HLA and KIR typing. However, starting with FASTQ files from Minion, am finding it difficult to determine exactly what’s required to get the tool working for my samples. It has to do with the reference genome to use. I have tried different downloads of GRCh38 as well as the graphs/PRG_MHC_GRCh38_withIMGT/mapping_PRGonly/referenceGenome.fa and graphs/PRG_MHC_GRCh38_withIMGT/extendedReferenceGenome/extendedReferenceGenome.fa

Here is my workflow:

FASTQ input >> (minimap2) MAP to HS reference >> (samtools) sort+index >> HLA-LA

And the reference ($REFERENCE) genomes I have tried::

grep -c \> *.fna
GRCh38.encode.fna:   194
GRCh38.p14.fna:  709
hs38DH.fna:  456 < https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz>

graphs/PRG_MHC_GRCh38_withIMGT/mapping_PRGonly/referenceGenome.fa
graphs/PRG_MHC_GRCh38_withIMGT/extendedReferenceGenome/extendedReferenceGenome.fa

The commands and issues for just a single sample:

minimap2 -ax map-ont -t 8 $REFERENCE ./fastq_files/WT47_barcode75.fastq.gz | samtools sort -o WT47_barcode75.sorted.bam -T WT47_barcode75.tmp

samtools index WT47_barcode75.sorted.bam

/data/eval/tools/HLA-LA/src/HLA-LA.pl --graph PRG_MHC_GRCh38_withIMGT  --sampleID WT47_barcode75  --BAM WT47_barcode75.sorted.bam 
HLA-LA.pl

Identified paths:
        samtools_bin: /data/eval/tools/usr/local/bin/samtools
        bwa_bin: /data/eval/tools/bin/bwa
        java_bin: /usr/bin/java
        picard_sam2fastq_bin: /data/eval/tools/bin/picard.jar
        General working directory: /data/eval/tools/HLA-LA/working
        Sample-specific working directory: /data/eval/tools/HLA-LA/working/P129901065_barcode06

Have found no compatible reference specifications in /data/eval/tools/HLA-LA/src/../graphs/PRG_MHC_GRCh38_withIMGT/knownReferences - create a file for this BAM file and try again. at /data/eval/tools/HLA-LA/src/HLA-LA.pl line 364.

Specifically, here are the issues that need clarification:

  1. Please specify which reference in the graphs or from online repositories is appropriate to use for this tool.
  2. Should someone need to update the HLA or KIR references, what's the correct way to do so. The available files [Update graphs.txt] and [Update KIR data.txt] have hard-coded paths to directories not in this repository and reference to tools that are not included here.

Thanks for your help.

Regards

Fredrick Mobegi @mobeginomics [twitter]

afadda1 commented 1 year ago

i'm also interested in running nanopore files through this tool. however, i would think that aligning with (long read) standard methods to the entire IMGT fasta files should get us accurate allele results, since the problem of short reads aligning to multiple locations does not exist here; the whole idea behind long read sequencing for HLA typing ..

AlexanderDilthey commented 11 months ago

Hi @fmobegi,

I would recommend just using the standard 1000 Genomes reference, e.g. from here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

... because we have a pre-existing reference specification file for the 1000G reference.

If using another reference is important to you, send me the output from samtools idxstats on one of the BAMs based on that reference, and I will add a reference specification file.

KIR typing is not supported by HLA*LA.

@afadda1 When it comes to long reads, the issue is not so much the accurate mapping, but the sequencing erros and determining the set of alleles that are present.

Best wishes

Alex