DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
687 stars 266 forks source link

human herpesvirus 2 missing from database #782

Closed rgiannico closed 6 months ago

rgiannico commented 6 months ago

Human herpesvirus 2 sequence is absent from the Viral Kraken Database (and all the derivate Databases). It's strange because:

Is there a specific reason why it is missing? Here some simple code for reproducibility:

# get krakendb viral taxa
$ wget https://genome-idx.s3.amazonaws.com/kraken/viral_20231009/library_report.tsv
$ grep "herpesvirus 2" library_report.tsv | cut -f 2 | sort > krakendb.txt

# get RefSeq viral taxas
$ wget https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.1.genomic.fna.gz
$ zgrep "^>" viral.1.1.genomic.fna.gz | grep "herpesvirus 2" | sort > viralgenomic.txt

# find differences
$ diff -y krakendb.txt viralgenomic.txt
>NC_001350.1 Saimiriine herpesvirus 2 complete genome           >NC_001350.1 Saimiriine herpesvirus 2 complete genome
>NC_001650.2 Equid herpesvirus 2 strain 86/67, complete genom   >NC_001650.2 Equid herpesvirus 2 strain 86/67, complete genom
                                                              > >NC_001798.2 Human herpesvirus 2 strain HG52, complete genome
>NC_002229.3 Gallid herpesvirus 2, complete genome              >NC_002229.3 Gallid herpesvirus 2, complete genome
>NC_003521.1 Panine herpesvirus 2 strain Heberling, complete    >NC_003521.1 Panine herpesvirus 2 strain Heberling, complete
>NC_006560.1 Cercopithecine herpesvirus 2, complete genome      >NC_006560.1 Cercopithecine herpesvirus 2, complete genome
>NC_007646.1 Ovine herpesvirus 2 strain BJ1035, complete geno   >NC_007646.1 Ovine herpesvirus 2 strain BJ1035, complete geno
>NC_007653.1 Papiine herpesvirus 2, complete genome             >NC_007653.1 Papiine herpesvirus 2, complete genome
>NC_008210.1 Ranid herpesvirus 2 strain ATCC VR-568, complete   >NC_008210.1 Ranid herpesvirus 2 strain ATCC VR-568, complete
>NC_019495.1 Cyprinid herpesvirus 2 strain ST-J1, complete ge   >NC_019495.1 Cyprinid herpesvirus 2 strain ST-J1, complete ge
>NC_020231.1 Caviid herpesvirus 2 strain 21222, complete geno   >NC_020231.1 Caviid herpesvirus 2 strain 21222, complete geno
>NC_024382.1 Alcelaphine herpesvirus 2 isolate topi-AlHV-2, c   >NC_024382.1 Alcelaphine herpesvirus 2 isolate topi-AlHV-2, c
>NC_036579.1 Ictalurid herpesvirus 2 strain 760/94, complete    >NC_036579.1 Ictalurid herpesvirus 2 strain 760/94, complete
>NC_038265.1 Porcine lymphotropic herpesvirus 2 isolate 568 l   >NC_038265.1 Porcine lymphotropic herpesvirus 2 isolate 568 l
>NC_038860.1 Pongine herpesvirus 2 (Orangutan herpesvirus) gB   >NC_038860.1 Pongine herpesvirus 2 (Orangutan herpesvirus) gB
>NC_043042.1 Acipenserid herpesvirus 2 strain SRWSHV, partial   >NC_043042.1 Acipenserid herpesvirus 2 strain SRWSHV, partial
>NC_043044.1 Salmonid herpesvirus 2 isolate NeVTA ORF68-like    >NC_043044.1 Salmonid herpesvirus 2 isolate NeVTA ORF68-like
>NC_043059.1 Caprine herpesvirus 2 glycoprotein B (gB) and DN   >NC_043059.1 Caprine herpesvirus 2 glycoprotein B (gB) and DN
>NC_043062.1 Phocid herpesvirus 2 DNA-dependent DNA polymeras   >NC_043062.1 Phocid herpesvirus 2 DNA-dependent DNA polymeras
>NC_043063.1 Iguanid herpesvirus 2 DNA-dependent DNA polymera   >NC_043063.1 Iguanid herpesvirus 2 DNA-dependent DNA polymera
>NC_075563.1 Cervid alphaherpesvirus 2 strain Norway, complet   >NC_075563.1 Cervid alphaherpesvirus 2 strain Norway, complet
>NC_075802.1 Salmonid herpesvirus 2 isolate NeVTA DNA polymer   >NC_075802.1 Salmonid herpesvirus 2 isolate NeVTA DNA polymer
>NC_076512.1 Bovine alphaherpesvirus 2 strain C1Z FZR, comple   >NC_076512.1 Bovine alphaherpesvirus 2 strain C1Z FZR, comple
>NC_076513.1 Macropodid alphaherpesvirus 2 strain V3077/08, c   >NC_076513.1 Macropodid alphaherpesvirus 2 strain V3077/08, c
>NC_076966.1 Cacatuid alphaherpesvirus 2 isolate CaHV2/Melbou   >NC_076966.1 Cacatuid alphaherpesvirus 2 isolate CaHV2/Melbou
jenniferlu717 commented 6 months ago

The Human alphaherpesvirus 2 assembly is listed as a "scaffold" level assembly in the NCBI assembly_summary.txt file. By default, Kraken only uses complete genomes