DaehwanKimLab / hisat-genotype

GNU General Public License v3.0
25 stars 15 forks source link

Updating hisatgenotype_db #12

Closed davetang closed 4 years ago

davetang commented 4 years ago

Hello again,

the files in https://github.com/infphilo/hisatgenotype_db are quite old. I tried to replace the HLA files with the latest version but I got the following error when building the index:

Error: Failed Variant Sorting!!
1284-I-GGAGACGCTGCAGC_1284-I-GAAGGAGACGCTGCAGC

Are there plans on updating the database or instructions on how to do so?

Thanks, Dave

chbe-helix commented 4 years ago

Hi Dave,

Yes, there are plans to update the database. The error is a failure of a sanity check for when the variants are sorted. I left it active in the code temporarily to help users debug potentially catastrophic errors. What you have there looks like a simple change in the way the sorting is done in that version of the database.

Long story short: At the top of the hisatgenotype_typing_core.py, hisatgenotype_typing_common.py, and hisatgenotype_typing_process.py files in the hisatgenotype_modules folder set SANITY_CHECK = True to SANITY_CHECK = False. You should be able to run the code again with proper building.

This will be set to False by default in the next release.

Thanks! Chris

davetang commented 4 years ago

Hi Chris,

thank you! Turning off the sanity checking worked. Here's the new output using the newer database:

# VERSIONS:
# HISAT2 - 2.2.0

# HISAT-genotype - 1.3.0

# Database - Database hla derived from HISATgenotype DB version: NONE
# COMMAND:
/data/dtang/hisat-genotype/hisatgenotype --threads 4 --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz
        A

                hisat2 graph
                        1502 reads and 771 pairs are aligned
                                1 A*02:01:01:01 (count: 418)
                                2 A*02:01:01:31 (count: 418)
                                3 A*02:01:01:16 (count: 407)
                                4 A*02:01:01:22 (count: 407)
                                5 A*02:658 (count: 405)
                                6 A*02:904 (count: 405)
                                7 A*02:01:127 (count: 404)
                                8 A*02:01:31 (count: 404)
                                9 A*02:20:01 (count: 404)
                                10 A*02:20:02 (count: 404)

                                1 ranked A*11:01:01:01 (abundance: 43.14%)
                                2 ranked A*02:01:01:59 (abundance: 21.18%)
                                3 ranked A*02:01:01:01 (abundance: 14.27%)
                                4 ranked A*02:01:01:31 (abundance: 14.27%)
                                5 ranked A*02:01:01:22 (abundance: 7.13%)

Just for your reference I used files cloned from https://github.com/ANHIG/IMGTHLA. The newer hla.dat file is over 100M in size, so Git LFS is required to download this file. I guess when you update https://github.com/DaehwanKimLab/hisatgenotype_db you can use Git LFS for sharing this file or simply zip up hla.dat and have your script unzip it after cloning.

Thanks again! Dave

chbe-helix commented 4 years ago

Hi Dave,

Fabulous! Glad it is working. I'll make a note to update the website with instructions for updating the database manually. Thanks!

Thanks, Chris

shiwanyin commented 2 years ago

Hi Chris,

thank you! Turning off the sanity checking worked. Here's the new output using the newer database:

# VERSIONS:
# HISAT2 - 2.2.0

# HISAT-genotype - 1.3.0

# Database - Database hla derived from HISATgenotype DB version: NONE
# COMMAND:
/data/dtang/hisat-genotype/hisatgenotype --threads 4 --base hla --locus-list A -1 ILMN/NA12892.extracted.1.fq.gz -2 ILMN/NA12892.extracted.2.fq.gz
        A

                hisat2 graph
                        1502 reads and 771 pairs are aligned
                                1 A*02:01:01:01 (count: 418)
                                2 A*02:01:01:31 (count: 418)
                                3 A*02:01:01:16 (count: 407)
                                4 A*02:01:01:22 (count: 407)
                                5 A*02:658 (count: 405)
                                6 A*02:904 (count: 405)
                                7 A*02:01:127 (count: 404)
                                8 A*02:01:31 (count: 404)
                                9 A*02:20:01 (count: 404)
                                10 A*02:20:02 (count: 404)

                                1 ranked A*11:01:01:01 (abundance: 43.14%)
                                2 ranked A*02:01:01:59 (abundance: 21.18%)
                                3 ranked A*02:01:01:01 (abundance: 14.27%)
                                4 ranked A*02:01:01:31 (abundance: 14.27%)
                                5 ranked A*02:01:01:22 (abundance: 7.13%)

Just for your reference I used files cloned from https://github.com/ANHIG/IMGTHLA. The newer hla.dat file is over 100M in size, so Git LFS is required to download this file. I guess when you update https://github.com/DaehwanKimLab/hisatgenotype_db you can use Git LFS for sharing this file or simply zip up hla.dat and have your script unzip it after cloning.

Thanks again! Dave Hi, Davetang could you tell me how to updata hisatgenotype_db ? i have tried many times but failed. my way as follows:

I updated the DB by downloading the fasta and msf and hla.dat files from https://github.com/ANHIG/IMGTHLA. I use the DB with basic command "hisatgenotype --base hla --locus-list A,....... -1 ...... -2 ......."

davetang commented 2 years ago

@shiwanyin I can no longer update to the newest IMGTHLA database (3.46.0) using HISAT-genotype v1.3.3. Actually, even though I could update the database previously, I switched back to using the IMGTHLA database provided by hisatgenotype_db because I was getting incorrect typing results with my manually updated database.