EBIvariation / vcf-validator

Validation suite for Variant Call Format (VCF) files, implemented using C++11
Apache License 2.0
129 stars 39 forks source link

Example for building .fai? #196

Closed CholoTook closed 4 years ago

CholoTook commented 4 years ago

Hi, I'm building a .fai using the following steps, but something must be wrong:

wget ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz
gunzip -c GRCh38_latest_genomic.fna.gz > GRCh38_latest_genomic.fna
../samtools-1.9/htslib-1.9/bgzip -c GRCh38_latest_genomic.fna > GRCh38_latest_genomic.fna.bgz
../samtools-1.9/samtools faidx GRCh38_latest_genomic.fna.bgz

Which gives:

GRCh38_latest_genomic.fna.gz
GRCh38_latest_genomic.fna
GRCh38_latest_genomic.fna.bgz
GRCh38_latest_genomic.fna.bgz.fai
GRCh38_latest_genomic.fna.bgz.gzi

However:

time ./vcf_assembly_checker_linux -i cardiomics-1000.vcf -f GRCh38_latest_genomic.fna.bgz -a GRCh38.p13_assembly_report.txt
[info] Reading from input VCF file...
[info] Reading from input FASTA file...
[info] Reading from input FASTA index file...
[info] Number of matches: 3/970
[info] Percentage of matches: 0.309278%

real    0m0.240s
user    0m0.237s
sys 0m0.003s

vs:

time ./vcf_assembly_checker_linux -i cardiomics-1000.vcf -f GRCh38_latest_genomic.fna -a GRCh38.p13_assembly_report.txt
[info] Reading from input VCF file...
[info] Reading from input FASTA file...
[info] Reading from input FASTA index file...
[info] Creating index from input FASTA file...
[info] Number of matches: 750/970
[info] Percentage of matches: 77.3196%

real    0m29.822s
user    0m27.487s
sys 0m2.334s
CholoTook commented 4 years ago

Is the issue applying GRCh38.p13_assembly_report.txt after having created the index?

CholoTook commented 4 years ago

BTW, just for completeness, I got that file from here:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_assembly_report.txt -o GRCh38.p13_assembly_report.txt

jmmut commented 4 years ago

The program should have complained about passing a "fna.gz", I think the code can't work with that. We can decompress VCFs but the stuff about the index makes it complicated for fastas and I think that the library that we use for reading the fasta doesn't support compressed fastas.

Long story short, I'm afraid you have to use the decompressed fasta. Preparing the fasta index will make it run faster:

../samtools-1.9/samtools faidx GRCh38_latest_genomic.fna
./vcf_assembly_checker_linux -i cardiomics-1000.vcf -f GRCh38_latest_genomic.fna -a GRCh38.p13_assembly_report.txt

We have a pending task of improving the messages about wrong parameters, I updated it to include this constraint with compressed fastas.

jmmut commented 4 years ago

I'm sorry about these annoying bugs. I know you didn't want to deal with experimental stuff, but these are new features and there are no previous stable versions of them in this repo.

CholoTook commented 4 years ago

No probs, thanks for clear explanations.

On Tue, 5 Nov 2019, 16:57 jmmut, notifications@github.com wrote:

I'm sorry about these annoying bugs. I know you didn't want to deal with experimental stuff, but these are new features and there are no previous stable versions of them in this repo.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/EBIvariation/vcf-validator/issues/196?email_source=notifications&email_token=ANKSZTQSHT4UUUSRD6QQT2LQSGQYPA5CNFSM4JJEGFW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDDOBTY#issuecomment-549904591, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANKSZTXS2MHJMWE6RYKT6FLQSGQYPANCNFSM4JJEGFWQ .

tcezard commented 4 years ago

This was fixed in EVA-1731: closing