Generating protein DB from custom VCF fails

enriquea commented 3 years ago

Hi @ypriverol and @husensofteng ,

I followed the example in the documentation (https://pgatk.readthedocs.io/en/latest/pypgatk.html#variants-vcf-to-protein-sequences) and tried to generate a protein DB from a custom VEP-annotated VCF. The command exit without error, however, the output is empty.

Here is what I'm running:

python pypgatk_cli.py vcf-to-proteindb --vcf vcf_vep/veped_snp_indel.sorted.vcf \
                                    --input_fasta ensembl_files/Homo_sapiens.GRCh38.cdna.all.fa \
                                    --gene_annotations_gtf ensembl_files/Homo_sapiens.GRCh38.104.gtf \
                                    --annotation_field_name "" \
                                    --output_proteindb var_peptides.fa

Terminal output:

2021-10-12 14:24:44,956 - INFO - Committing changes: 3146000 features
2021-10-12 14:24:46,059 - INFO - Populating features table and first-order relations: 3146131 features
2021-10-12 14:24:46,063 - INFO - Creating relations(parent) index
2021-10-12 14:24:50,761 - INFO - Creating relations(child) index
2021-10-12 14:24:53,919 - INFO - Creating features(featuretype) index
2021-10-12 14:24:58,641 - INFO - Creating features (seqid, start, end) index
2021-10-12 14:25:01,295 - INFO - Creating features (seqid, start, end, strand) index
2021-10-12 14:25:04,036 - INFO - Running ANALYSE features

But there are no records in the output fasta file. Any idea?

Thanks

enriquea commented 3 years ago

Here are some lines of the input VCF

header (truncated):

##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=PASS,Description="Site contains at least one allele that passes filters">
##FILTER=<ID=VQSRTrancheINDEL99.50to99.60,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -1.4424 <= x < -1.1563">
##FILTER=<ID=VQSRTrancheINDEL99.60to99.70,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -1.8923 <= x < -1.4424">
##FILTER=<ID=VQSRTrancheINDEL99.70to99.80,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -2.5164 <= x < -1.8923">
##FILTER=<ID=VQSRTrancheINDEL99.80to99.90,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -4.0724 <= x < -2.5164">
##FILTER=<ID=VQSRTrancheINDEL99.90to100.00+,Description="Truth sensitivity tranche level for INDEL model at VQS Lod < -28502.0108">
##FILTER=<ID=VQSRTrancheINDEL99.90to100.00,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -28502.0108 <= x < -4.0724">
##FILTER=<ID=VQSRTrancheSNP99.70to99.80,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -3.8651 <= x < -3.0707">
##FILTER=<ID=VQSRTrancheSNP99.80to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -6.9748 <= x < -3.8651">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -38342.3807">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -38342.3807 <= x < -6.9748">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">

variants (truncated):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1    65797   rs759525926     T       C       2670.65 VQSRTrancheSNP99.80to99.90      AC=14;AF=0.004871;AN=2874;AS_BaseQRankSum=0;AS_FS=0;AS_InbreedingCoeff=-0.053;AS_MQ=27;AS_MQRankSum=0;AS_QD=28.41;AS_ReadPosRankSum=0;AS_SOR=0.105;BaseQRankSum=0;DB;DP=6789;ExcessHet=0.0009;FS=0;InbreedingCoeff=0.3435;MLEAC=559;MLEAF=0.195;MQ=27;MQRankSum=0;QD=28.41;ReadPosRankSum=0.46;SOR=0.142;VQSLOD=-4.084e+00;culprit=MQ;CSQ=C|upstream_gene_variant|MODIFIER|OR4F5|ENSG00000186092|Transcript|ENST00000335137|protein_coding||||||||||rs759525926|1|3258|1||SNV|HGNC|HGNC:14825|YES|||P1|CCDS30547.1|ENSP00000334393|Q8NH21||UPI0000041BC1|||||||||||||||||||||||||||||||||||8.458|0.700325|||||,C|downstream_gene_variant|MODIFIER|OR4G11P|ENSG00000240361|Transcript|ENST00000492842|transcribed_unprocessed_pseudogene||||||||||rs759525926|1|1910|1||SNV|HGNC|HGNC:31276||||||||||||||||||||||||||||||||||||||||||||8.458|0.700325|||||,C|intron_variant|MODIFIER|OR4F5|ENSG00000186092|Transcript|ENST00000641515|protein_coding||2/2|ENST00000641515.2:c.9+224T>C|||||||rs759525926|1||1||SNV|HGNC|HGNC:14825||||||ENSP00000493376||A0A2U3U0J3|UPI000D1938F0|||||||||||||||||||||||||||||||||||8.458|0.700325|||||,C|downstream_gene_variant|MODIFIER|OR4G11P|ENSG00000240361|Transcript|ENST00000642116|processed_transcript||||||||||rs759525926|1|1681|1||SNV|HGNC|HGNC:31276|YES|||||||||||||||||||||||||||||||||||||||||||8.458|0.700325|||||
chr1    65851   .       C       T       522.75  PASS    AC=5;AF=0.0009335;AN=5356;AS_BaseQRankSum=0;AS_FS=0;AS_InbreedingCoeff=0.0361;AS_MQ=26.59;AS_MQRankSum=-1.2;AS_QD=23.77;AS_ReadPosRankSum=0;AS_SOR=1.085;BaseQRankSum=0;DP=28101;ExcessHet=0;FS=0;InbreedingCoeff=0.362;MLEAC=65;MLEAF=0.012;MQ=26.76;MQRankSum=-1.15;QD=23.76;ReadPosRankSum=0.086;SOR=1.085;VQSLOD=-2.748e+00;culprit=MQ;CSQ=T|upstream_gene_variant|MODIFIER|OR4F5|ENSG00000186092|Transcript|ENST00000335137|protein_coding||||||||||rs1476411183|1|3204|1||SNV|HGNC|HGNC:14825|YES|||P1|CCDS30547.1|ENSP00000334393|Q8NH21||UPI0000041BC1|||||||||||||||||||||||||||||||||||3.808|0.258882|||||,T|downstream_gene_variant|MODIFIER|OR4G11P|ENSG00000240361|Transcript|ENST00000492842|transcribed_unprocessed_pseudogene||||||||||rs1476411183|1|1964|1||SNV|HGNC|HGNC:31276||||||||||||||||||||||||||||||||||||||||||||3.808|0.258882|||||,T|intron_variant|MODIFIER|OR4F5|ENSG00000186092|Transcript|ENST00000641515|protein_coding||2/2|ENST00000641515.2:c.9+278C>T|||||||rs1476411183|1||1||SNV|HGNC|HGNC:14825||||||ENSP00000493376||A0A2U3U0J3|UPI000D1938F0|||||||||||||||||||||||||||||||||||3.808|0.258882|||||,T|downstream_gene_variant|MODIFIER|OR4G11P|ENSG00000240361|Transcript|ENST00000642116|processed_transcript||||||||||rs1476411183|1|1735|1||SNV|HGNC|HGNC:31276|YES|||||||||||||||||||||||||||||||||||||||||||3.808|0.258882|||||

ypriverol commented 3 years ago

This VCF is now giving errors.

husensofteng commented 3 years ago

@enriquea please also specify the transcript ID index parameter as shown below. it seems in your VCF the transcript is found at index 6 rather than the third index which is the default value.

python pypgatk_cli.py vcf-to-proteindb --vcf vcf_vep/veped_snp_indel.sorted.vcf \
                                    --input_fasta ensembl_files/Homo_sapiens.GRCh38.cdna.all.fa \
                                    --gene_annotations_gtf ensembl_files/Homo_sapiens.GRCh38.104.gtf \
                                    --annotation_field_name "CSQ" \
                                    --transcript_index 6 \
                                    --output_proteindb var_peptides.fa

husensofteng commented 3 years ago

I have now pushed commit 2b59d7d that helps solve this issue to read the transcript and consequence indices from the INFO field in the VCF header. So there is no need to specify the transcript_index and consequence_index parameters anymore.

bigbio / py-pgatk

Generating protein DB from custom VCF fails #48