DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
473 stars 116 forks source link

hisatgenotype_v1.1.2_beta-devel build-genome gives some empty files #204

Closed mariafiruleva closed 5 years ago

mariafiruleva commented 5 years ago

Hi @chbe-helix , Earlier, you recommended me to use the hisatgenotype_v1.1.2_beta-devel branch. When I run: hisatgenotype_toolkit build-genome --base hla -v -p 16 some files which I receive are empty (e.g., hla.snp, hla.partial, hla.allele). If I run this code block after previous command:

hisatgenotype_toolkit extract-vars --base hla --inter-gap 30 --intra-gap 50 --min-var-freq 0.1
hisat2-build -p 16 --snp hla.index.snp --haplotype hla.haplotype hla.fa hla

I reveice fullfilled files.

In my initial dir, I have ILMN, grch38 and hisatgenotype_db subdirs (3370 IMGT version), genome.fa and genome.fai files.

chbe-helix commented 5 years ago

Hi Maria,

The build-genomes (originally hisatgenotype_build_genomes.py) is used to build the full genotype_genome and will require >200Gb of RAM to complete. The --base option in build-genome is actually the name of the genome file to build.

What this script does is: 1) build all gene databases found in hisatgenotype_db using extract-var 2) build a genotype genome named the same as --base from SNP info and the gene databases

By naming the genotype genome hla you are overwriting the original hla files built resulting in the script trying to read from empty files it is now trying to write to. I hope this helps with your issue.

mariafiruleva commented 5 years ago

Thanks, chbe-helix. I will be really grateful if you provide a full instruction for running hisatgenotype_v1.1.2_beta-devel pipeline for HLA typing.

chbe-helix commented 5 years ago

Hi Maria,

Sure thing! Here is the new HISATgenotype website (specifically a link to the tutorial): https://daehwankimlab.github.io/hisat-genotype/tutorials/

For you, since you are using a different HLA database than we provide, you will just need to add these steps prior to running the hisatgenotype wrapper:

mkdir hisatgenotype_db
git clone -b 3350 https://github.com/ANHIG/IMGTHLA.git hisatgenotype_db/HLA
hisatgenotype_toolkit extract-vars --base hla
mariafiruleva commented 5 years ago

Ok, I run:

mkdir hisatgenotype_db
git clone -b 3350 https://github.com/ANHIG/IMGTHLA.git hisatgenotype_db/HLA
hisatgenotype_toolkit extract-vars --base hla

As a result, I've all empty files in my initial directory:

  51662809 genome.fa
       194 genome.fa.fai
wc: grch38: Is a directory
         0 grch38
wc: hisatgenotype_db: Is a directory
         0 hisatgenotype_db
         0 hla.allele
         0 hla_backbone.fa
         0 hla.haplotype
         0 hla.index.snp
         0 hla.link
         0 hla.locus
         0 hla.partial
         0 hla_sequences.fa
         0 hla.snp
         0 hla.snp.freq
  51663003 total

After that, I run:

hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz --threads 16

and received an error:

hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz --threads 16
Error: genotype_genome related files do not exist as follows:
    genotype_genome.fa
    genotype_genome.locus
    genotype_genome.snp
    genotype_genome.haplotype
    genotype_genome.link
    genotype_genome.coord
    genotype_genome.clnsig
    genotype_genome.1.ht2
    genotype_genome.2.ht2
    genotype_genome.3.ht2
    genotype_genome.4.ht2
    genotype_genome.5.ht2
    genotype_genome.6.ht2
    genotype_genome.7.ht2
    genotype_genome.8.ht2
chbe-helix commented 5 years ago

Hi Maria,

I found what I believe to be the root of your problem and a bug in my code that is confounding everything. Here is the code I ran to get HISAT-genotype running:

git clone https://github.com/DaehwanKimLab/hisat-genotype hisat-genotype
cd hisat-genotype/
git checkout hisatgenotype_v1.1.2_beta-devel 
make hisat2-align-s hisat2-build-s hisat2-inspect-s

This next step needs to have the PATH/TO/DIR/ that contains hisat-genotype and will be specific to your configuration

export PATH=hisat-genotype:$PATH
export PYTHONPATH=hisat-genotype/hisatgenotype_modules:$PYTHONPATH

The addition to this step is the downloading of the genotype_genome. This is the small bug in my script as this should be automatic. For now we will have to manually download it.

mkdir hla-analysis
cd hla-analysis/

wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat-genotype/data/hla/ILMN.tar.gz
tar xvzf ILMN.tar.gz 

wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat-genotype/data/genotype_genome_20180128.tar.gz
tar xvzf genotype_genome_20180128.tar.gz

mkdir hisatgenotype_db/
git clone -b 3350 https://github.com/ANHIG/IMGTHLA.git hisatgenotype_db/HLA

Here is what I believe to be the crux of your troubles. The DQB data in this database version has many inconsistencies between the files and possibly some missing sequences causing a state assertion error forcing the script to stop prematurely. At this time there is nothing I can do to fix it as it is a database problem. The best bet is to remove it from your analysis.

mv hisatgenotype_db/HLA/msf/DQB* ./
mv hisatgenotype_db/HLA/fasta/DQB* ./

Now you should be able to run hisatgenotype as desribed.

hisatgenotype_toolkit extract-vars --base HLA
hisatgenotype --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz 
mariafiruleva commented 5 years ago

Thanks again. ) It work's fine for me. As I understood, for HLA typing there is no need to download and extract _genotypegenome? If I run hisat2 without these files with some changes in your code, everything is ok.

Would you please test 3350 db with old branch of hisat2? I've not seen this error early, and also there is no warning in case of code in my first post of this issue.

chbe-helix commented 5 years ago

Hi Maria,

If you plan on "extracting" the HLA reads prior to typing (this is recommended in v1.0.1 and embedded in the pipeline of v1.1.2) then yes you'll need the genotype_genome. Currently there is no way to omit this step in v1.1.2. There will be an option to bypass read extraction in a later release.

I tested hisatgenotype v1.0.1 with the 3350 db and found the same problem. You are welcome to use v1.0.1 though I would still recommend removing DQB gene data and extracting the reads.

You are absolutely correct on the error messages. v1.0.1 has messages that pop-up that you can read and debug a little easier. v1.1.2 should also have these messages but, due to a bug in my multithreading code, these are not displayed. I am working on a fix.