HISAT-genotype tutorial

davetang commented 4 years ago

Hello,

I'm trying to follow the HISAT-genotype tutorial but I'm having problems on the hisatgenotype step.

hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz
1: Extracting reads from NA12892.extracted.1.fq.gz
Cloning into 'hisatgenotype_db'...
remote: Enumerating objects: 750, done.
remote: Total 750 (delta 0), reused 0 (delta 0), pack-reused 750
Receiving objects: 100% (750/750), 31.27 MiB | 9.07 MiB/s, done.
Resolving deltas: 100% (214/214), done.
No grch38 file found

# proceeds to download and extract grch38.tar.gz

/data/hisat-genotype/hisat2/hisat2-inspect:24: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
Traceback (most recent call last):
  File "/data/hisat-genotype/hisat2/hisat2-inspect", line 73, in <module>
    main()
  File "/data/hisat-genotype/hisat2/hisat2-inspect", line 69, in main
    os.execv(inspect_bin_spec, arguments)
FileNotFoundError: [Errno 2] No such file or directory
Could not build fai index genome.fa.fai
No hla_backbone.fa file found
Building hla Database
No genome.fa.fai file found

# proceeds to download and extract grch38.tar.gz again
# then fails

Do you know what's wrong?

I'm running everything inside a Docker container and installed samtools using Conda. The Dockerfile used to create my image is below:

FROM ubuntu:18.04

RUN apt-get clean all && \
   apt-get update && \
   apt-get upgrade -y && \
   apt-get install -y \
      build-essential \
      git \
      wget \
      vim \
      libhdf5-dev \
      libcurl4-gnutls-dev \
      libssl-dev \
      libxml2-dev \
      libpng-dev \
      zlib1g-dev \
   && apt-get clean all && \
   apt-get purge && \
   rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

# Miniconda and dependencies
RUN cd /tmp/ && \
        wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
        bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3 && \
        /root/miniconda3/condabin/conda install -y python=3.7
ENV PATH=$PATH:/root/miniconda3/bin

chbe-helix commented 4 years ago

Hi Dave,

What happens if you let the script run for 30 minutes with the "no grch38 file found" message? It may be trying to download it and not giving a message that it is doing so. If this is the case then it is a messaging issue and I'd suggest letting it run until it completes. Let me know if this is the case.

Thanks, Chris

davetang commented 4 years ago

Hi Chris,

thanks for the reply! I saved the log this time:

hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz 2> log

Here's the full log sans the wget output after running it for over an hour (it just hangs there afterwards):

cat log | perl -nle 'next if /\d+[hms]$/; print' 
    1: Extracting reads from NA12892.extracted.1.fq.gz
Cloning into 'hisatgenotype_db'...
Checking out files: 100% (273/273), done.
--2020-05-30 01:45:30--  ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz
           => ‘grch38.tar.gz’
Resolving ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)... 128.220.174.63
Connecting to ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)|128.220.174.63|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/infphilo/hisat2/data ... done.
==> SIZE grch38.tar.gz ... 4210306865
==> PASV ... done.    ==> RETR grch38.tar.gz ... done.
Length: 4210306865 (3.9G) (unauthoritative)

2020-05-30 02:11:37 (2.57 MB/s) - ‘grch38.tar.gz’ saved [4210306865]

/data/hisat-genotype/hisat2/hisat2-inspect:24: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
Traceback (most recent call last):
  File "/data/hisat-genotype/hisat2/hisat2-inspect", line 73, in <module>
    main()
  File "/data/hisat-genotype/hisat2/hisat2-inspect", line 69, in main
    os.execv(inspect_bin_spec, arguments)        
FileNotFoundError: [Errno 2] No such file or directory
Could not build fai index genome.fa.fai
--2020-05-30 02:14:14--  ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz
           => ‘grch38.tar.gz’
Resolving ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)... 128.220.174.63
Connecting to ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)|128.220.174.63|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/infphilo/hisat2/data ... done.
==> SIZE grch38.tar.gz ... 4210306865
==> PASV ... done.    ==> RETR grch38.tar.gz ... done.
Length: 4210306865 (3.9G) (unauthoritative)

2020-05-30 02:38:19 (2.78 MB/s) - ‘grch38.tar.gz’ saved [4210306865]

/data/hisat-genotype/hisat2/hisat2-inspect:24: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
Traceback (most recent call last):
  File "/data/hisat-genotype/hisat2/hisat2-inspect", line 73, in <module>
    main()
  File "/data/hisat-genotype/hisat2/hisat2-inspect", line 69, in main
    os.execv(inspect_bin_spec, arguments)        
FileNotFoundError: [Errno 2] No such file or directory
Could not build fai index genome.fa.fai
No grch38 file found
No hla_backbone.fa file found
Building hla Database
No genome.fa.fai file found
No hla.graph.1.ht2 file found
No hla.graph.1.ht2 file found
Error: indexing HLA failed! Perhaps, you forgot to build hisat2 executables?

As I mentioned before I'm running this inside a Docker container and mounted hisat-genotype to /data/hisat-genotype. Inside the hisat2 directory are the compiled executables.

ls -1 hisat2/hisat2*
hisat2/hisat2
hisat2/hisat2-align-s
hisat2/hisat2-build
hisat2/hisat2-build-new
hisat2/hisat2-build-s
hisat2/hisat2-inspect
hisat2/hisat2.cpp
hisat2/hisat2.sln
hisat2/hisat2_build.cpp
hisat2/hisat2_build_main.cpp
hisat2/hisat2_extract_exons.py
hisat2/hisat2_extract_snps_haplotypes_UCSC.py
hisat2/hisat2_extract_snps_haplotypes_VCF.py
hisat2/hisat2_extract_splice_sites.py
hisat2/hisat2_inspect.cpp
hisat2/hisat2_main.cpp
hisat2/hisat2_read_statistics.py
hisat2/hisat2_repeat.cpp
hisat2/hisat2_repeat_main.cpp
hisat2/hisat2_simulate_reads.py

# other output not shown

The setup.sh script added the path to ~/.bashrc (which I sourced), so the executables can be found.

which hisat2-build
/data/hisat-genotype/hisat2/hisat2-build

which hisat2-inspect
/data/hisat-genotype/hisat2/hisat2-inspect

Inside hla-analysis are some output files but the files are empty except for hla.version.

ls -1 hla-analysis/hla*
hla-analysis/hla.allele
hla-analysis/hla.haplotype
hla-analysis/hla.index.snp
hla-analysis/hla.link
hla-analysis/hla.locus
hla-analysis/hla.partial
hla-analysis/hla.snp
hla-analysis/hla.snp.freq
hla-analysis/hla.version
hla-analysis/hla_backbone.fa
hla-analysis/hla_sequences.fa

cat hla-analysis/hla.version 
Database hla derived from HISATgenotype DB version: NONE

Inside hisatgenotype_out are the extracted reads.

ls -1 hisatgenotype_out/
NA12892.extracted.1.fq.gz-hla-extracted-1.fq.gz
NA12892.extracted.1.fq.gz-hla-extracted-2.fq.gz

Thanks, Dave

davetang commented 4 years ago

I've been checking out the Python files and found out that the error below:

/data/hisat-genotype/hisat2/hisat2-inspect:24: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp

came from the download_genome_and_index function in hisatgenotype_typing_common.py, which is caused by trying to run hisat2-inspect grch38/genome > genome.fa. This leads to a missing genome.fa file and also the genome.fa.fai, which is probably why grch38.tar.gz is downloaded twice because hisatgenotype is looking for these files.

I can manually create genome.fa by following the shell script in ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz and placing genome.fa and genome.fa.fai in hla-analysis and this stops hisatgenotype from downloading the genome again but I'm still stuck at the step where the script is looking for the hla index files.

hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz
        Files found: Omitted extracting reads from NA12892.extracted.1.fq.gz
No hla.graph.1.ht2 file found
No hla.graph.1.ht2 file found
Error: indexing HLA failed! Perhaps, you forgot to build hisat2 executables?

chbe-helix commented 4 years ago

Hi Dave,

That's very useful information. It sounds like it may be useful to go through a manual install. While I am putting together a set of steps to try, could you try building hisat2 binaries using the following command and let me know what happens?

cd hisat2
make hisat2-align hisat2-build hisat2-inspect

Thanks, Chris

davetang commented 4 years ago

Hi Chris,

cd hisat2/
make hisat2-align hisat2-build hisat2-inspect
make: *** No rule to make target 'hisat2-align'.  Stop.

Cheers, Dave

davetang commented 4 years ago

Hi Chris,

I got it working for now; below is my workflow.

Download required files.

mkdir /data/dtang/hisat-genotype-tutorial/ && cd /data/dtang/hisat-genotype-tutorial/
wget -c -N ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat-genotype/data/genotype_genome_20180128.tar.gz
wget -c -N ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat-genotype/data/hla/ILMN.tar.gz
wget -c -N  ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz
cd /data/dtang

Clone repository and run setup.sh.

git clone https://github.com/DaehwanKimLab/hisat-genotype.git hisat-genotype-test
cd hisat-genotype-test
sed -i 's/\r$//;s/sumbodule/submodule/' setup.sh
bash setup.sh

Compile all.

cd hisat2
make clean
make
cd ..

Prepare hla-analysis directory.

mkdir hla-analysis && cd hla-analysis
tar xvzf /data/dtang/hisat-genotype-tutorial/genotype_genome_20180128.tar.gz
tar xvzf /data/dtang/hisat-genotype-tutorial/ILMN.tar.gz
tar xvzf /data/dtang/hisat-genotype-tutorial/grch38.tar.gz
../hisat2/hisat2-inspect grch38/genome > genome.fa
samtools faidx genome.fa

Run with test data.

hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz

Same typing results as https://daehwankimlab.github.io/hisat-genotype/tutorials/.

cat hisatgenotype_out/assembly_graph-hla-NA12892_extracted_1_fq_gz-hla-extracted-1_fq.report 
# VERSIONS:
# HISAT2 - 2.2.0

# HISAT-genotype - 1.3.0

# Database - Database hla derived from HISATgenotype DB version: NONE
# COMMAND:
/data/dtang/hisat-genotype-test/hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz
        A

                hisat2 graph
                        1496 reads and 769 pairs are aligned
                                1 A*02:01:01:01 (count: 419)
                                2 A*02:251 (count: 407)
                                3 A*02:610:02 (count: 405)
                                4 A*02:524:02 (count: 403)
                                5 A*02:650 (count: 403)
                                6 A*02:01:123 (count: 402)
                                7 A*02:647 (count: 401)
                                8 A*02:01:126 (count: 400)
                                9 A*02:562 (count: 400)
                                10 A*02:645 (count: 400)

                                1 ranked A*02:01:01:01 (abundance: 51.95%)
                                2 ranked A*11:01:01:01 (abundance: 48.05%)

chbe-helix commented 4 years ago

Hi David,

That's great! I'll work to stabilize the setup script using your outputs and update the website to reflect the current state. Thanks for bringing this to my attention and working to resolve it.

Thanks, Chris

DaehwanKimLab / hisat-genotype

HISAT-genotype tutorial #11