Closed davetang closed 4 years ago
Hi Dave,
What happens if you let the script run for 30 minutes with the "no grch38 file found" message? It may be trying to download it and not giving a message that it is doing so. If this is the case then it is a messaging issue and I'd suggest letting it run until it completes. Let me know if this is the case.
Thanks, Chris
Hi Chris,
thanks for the reply! I saved the log this time:
hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz 2> log
Here's the full log sans the wget output after running it for over an hour (it just hangs there afterwards):
cat log | perl -nle 'next if /\d+[hms]$/; print'
1: Extracting reads from NA12892.extracted.1.fq.gz
Cloning into 'hisatgenotype_db'...
Checking out files: 100% (273/273), done.
--2020-05-30 01:45:30-- ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz
=> ‘grch38.tar.gz’
Resolving ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)... 128.220.174.63
Connecting to ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)|128.220.174.63|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/infphilo/hisat2/data ... done.
==> SIZE grch38.tar.gz ... 4210306865
==> PASV ... done. ==> RETR grch38.tar.gz ... done.
Length: 4210306865 (3.9G) (unauthoritative)
2020-05-30 02:11:37 (2.57 MB/s) - ‘grch38.tar.gz’ saved [4210306865]
/data/hisat-genotype/hisat2/hisat2-inspect:24: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
Traceback (most recent call last):
File "/data/hisat-genotype/hisat2/hisat2-inspect", line 73, in <module>
main()
File "/data/hisat-genotype/hisat2/hisat2-inspect", line 69, in main
os.execv(inspect_bin_spec, arguments)
FileNotFoundError: [Errno 2] No such file or directory
Could not build fai index genome.fa.fai
--2020-05-30 02:14:14-- ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz
=> ‘grch38.tar.gz’
Resolving ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)... 128.220.174.63
Connecting to ftp.ccb.jhu.edu (ftp.ccb.jhu.edu)|128.220.174.63|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/infphilo/hisat2/data ... done.
==> SIZE grch38.tar.gz ... 4210306865
==> PASV ... done. ==> RETR grch38.tar.gz ... done.
Length: 4210306865 (3.9G) (unauthoritative)
2020-05-30 02:38:19 (2.78 MB/s) - ‘grch38.tar.gz’ saved [4210306865]
/data/hisat-genotype/hisat2/hisat2-inspect:24: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
Traceback (most recent call last):
File "/data/hisat-genotype/hisat2/hisat2-inspect", line 73, in <module>
main()
File "/data/hisat-genotype/hisat2/hisat2-inspect", line 69, in main
os.execv(inspect_bin_spec, arguments)
FileNotFoundError: [Errno 2] No such file or directory
Could not build fai index genome.fa.fai
No grch38 file found
No hla_backbone.fa file found
Building hla Database
No genome.fa.fai file found
No hla.graph.1.ht2 file found
No hla.graph.1.ht2 file found
Error: indexing HLA failed! Perhaps, you forgot to build hisat2 executables?
As I mentioned before I'm running this inside a Docker container and mounted hisat-genotype
to /data/hisat-genotype
. Inside the hisat2
directory are the compiled executables.
ls -1 hisat2/hisat2*
hisat2/hisat2
hisat2/hisat2-align-s
hisat2/hisat2-build
hisat2/hisat2-build-new
hisat2/hisat2-build-s
hisat2/hisat2-inspect
hisat2/hisat2.cpp
hisat2/hisat2.sln
hisat2/hisat2_build.cpp
hisat2/hisat2_build_main.cpp
hisat2/hisat2_extract_exons.py
hisat2/hisat2_extract_snps_haplotypes_UCSC.py
hisat2/hisat2_extract_snps_haplotypes_VCF.py
hisat2/hisat2_extract_splice_sites.py
hisat2/hisat2_inspect.cpp
hisat2/hisat2_main.cpp
hisat2/hisat2_read_statistics.py
hisat2/hisat2_repeat.cpp
hisat2/hisat2_repeat_main.cpp
hisat2/hisat2_simulate_reads.py
# other output not shown
The setup.sh
script added the path to ~/.bashrc
(which I sourced), so the executables can be found.
which hisat2-build
/data/hisat-genotype/hisat2/hisat2-build
which hisat2-inspect
/data/hisat-genotype/hisat2/hisat2-inspect
Inside hla-analysis
are some output files but the files are empty except for hla.version
.
ls -1 hla-analysis/hla*
hla-analysis/hla.allele
hla-analysis/hla.haplotype
hla-analysis/hla.index.snp
hla-analysis/hla.link
hla-analysis/hla.locus
hla-analysis/hla.partial
hla-analysis/hla.snp
hla-analysis/hla.snp.freq
hla-analysis/hla.version
hla-analysis/hla_backbone.fa
hla-analysis/hla_sequences.fa
cat hla-analysis/hla.version
Database hla derived from HISATgenotype DB version: NONE
Inside hisatgenotype_out
are the extracted reads.
ls -1 hisatgenotype_out/
NA12892.extracted.1.fq.gz-hla-extracted-1.fq.gz
NA12892.extracted.1.fq.gz-hla-extracted-2.fq.gz
Thanks, Dave
I've been checking out the Python files and found out that the error below:
/data/hisat-genotype/hisat2/hisat2-inspect:24: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
came from the download_genome_and_index
function in hisatgenotype_typing_common.py
, which is caused by trying to run hisat2-inspect grch38/genome > genome.fa
. This leads to a missing genome.fa
file and also the genome.fa.fai
, which is probably why grch38.tar.gz
is downloaded twice because hisatgenotype
is looking for these files.
I can manually create genome.fa by following the shell script in ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz
and placing genome.fa
and genome.fa.fai
in hla-analysis
and this stops hisatgenotype
from downloading the genome again but I'm still stuck at the step where the script is looking for the hla index files.
hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz
Files found: Omitted extracting reads from NA12892.extracted.1.fq.gz
No hla.graph.1.ht2 file found
No hla.graph.1.ht2 file found
Error: indexing HLA failed! Perhaps, you forgot to build hisat2 executables?
Hi Dave,
That's very useful information. It sounds like it may be useful to go through a manual install. While I am putting together a set of steps to try, could you try building hisat2 binaries using the following command and let me know what happens?
cd hisat2
make hisat2-align hisat2-build hisat2-inspect
Thanks, Chris
Hi Chris,
cd hisat2/
make hisat2-align hisat2-build hisat2-inspect
make: *** No rule to make target 'hisat2-align'. Stop.
Cheers, Dave
Hi Chris,
I got it working for now; below is my workflow.
Download required files.
mkdir /data/dtang/hisat-genotype-tutorial/ && cd /data/dtang/hisat-genotype-tutorial/
wget -c -N ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat-genotype/data/genotype_genome_20180128.tar.gz
wget -c -N ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat-genotype/data/hla/ILMN.tar.gz
wget -c -N ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38.tar.gz
cd /data/dtang
Clone repository and run setup.sh
.
git clone https://github.com/DaehwanKimLab/hisat-genotype.git hisat-genotype-test
cd hisat-genotype-test
sed -i 's/\r$//;s/sumbodule/submodule/' setup.sh
bash setup.sh
Compile all.
cd hisat2
make clean
make
cd ..
Prepare hla-analysis
directory.
mkdir hla-analysis && cd hla-analysis
tar xvzf /data/dtang/hisat-genotype-tutorial/genotype_genome_20180128.tar.gz
tar xvzf /data/dtang/hisat-genotype-tutorial/ILMN.tar.gz
tar xvzf /data/dtang/hisat-genotype-tutorial/grch38.tar.gz
../hisat2/hisat2-inspect grch38/genome > genome.fa
samtools faidx genome.fa
Run with test data.
hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz
Same typing results as https://daehwankimlab.github.io/hisat-genotype/tutorials/.
cat hisatgenotype_out/assembly_graph-hla-NA12892_extracted_1_fq_gz-hla-extracted-1_fq.report
# VERSIONS:
# HISAT2 - 2.2.0
# HISAT-genotype - 1.3.0
# Database - Database hla derived from HISATgenotype DB version: NONE
# COMMAND:
/data/dtang/hisat-genotype-test/hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz
A
hisat2 graph
1496 reads and 769 pairs are aligned
1 A*02:01:01:01 (count: 419)
2 A*02:251 (count: 407)
3 A*02:610:02 (count: 405)
4 A*02:524:02 (count: 403)
5 A*02:650 (count: 403)
6 A*02:01:123 (count: 402)
7 A*02:647 (count: 401)
8 A*02:01:126 (count: 400)
9 A*02:562 (count: 400)
10 A*02:645 (count: 400)
1 ranked A*02:01:01:01 (abundance: 51.95%)
2 ranked A*11:01:01:01 (abundance: 48.05%)
Hi David,
That's great! I'll work to stabilize the setup script using your outputs and update the website to reflect the current state. Thanks for bringing this to my attention and working to resolve it.
Thanks, Chris
Hello,
I'm trying to follow the HISAT-genotype tutorial but I'm having problems on the
hisatgenotype
step.Do you know what's wrong?
I'm running everything inside a Docker container and installed
samtools
using Conda. The Dockerfile used to create my image is below: