Closed enriquea closed 3 years ago
Here are some lines of the input VCF
header (truncated):
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=PASS,Description="Site contains at least one allele that passes filters">
##FILTER=<ID=VQSRTrancheINDEL99.50to99.60,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -1.4424 <= x < -1.1563">
##FILTER=<ID=VQSRTrancheINDEL99.60to99.70,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -1.8923 <= x < -1.4424">
##FILTER=<ID=VQSRTrancheINDEL99.70to99.80,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -2.5164 <= x < -1.8923">
##FILTER=<ID=VQSRTrancheINDEL99.80to99.90,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -4.0724 <= x < -2.5164">
##FILTER=<ID=VQSRTrancheINDEL99.90to100.00+,Description="Truth sensitivity tranche level for INDEL model at VQS Lod < -28502.0108">
##FILTER=<ID=VQSRTrancheINDEL99.90to100.00,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -28502.0108 <= x < -4.0724">
##FILTER=<ID=VQSRTrancheSNP99.70to99.80,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -3.8651 <= x < -3.0707">
##FILTER=<ID=VQSRTrancheSNP99.80to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -6.9748 <= x < -3.8651">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -38342.3807">
##FILTER=<ID=VQSRTrancheSNP99.90to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -38342.3807 <= x < -6.9748">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
variants (truncated):
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 65797 rs759525926 T C 2670.65 VQSRTrancheSNP99.80to99.90 AC=14;AF=0.004871;AN=2874;AS_BaseQRankSum=0;AS_FS=0;AS_InbreedingCoeff=-0.053;AS_MQ=27;AS_MQRankSum=0;AS_QD=28.41;AS_ReadPosRankSum=0;AS_SOR=0.105;BaseQRankSum=0;DB;DP=6789;ExcessHet=0.0009;FS=0;InbreedingCoeff=0.3435;MLEAC=559;MLEAF=0.195;MQ=27;MQRankSum=0;QD=28.41;ReadPosRankSum=0.46;SOR=0.142;VQSLOD=-4.084e+00;culprit=MQ;CSQ=C|upstream_gene_variant|MODIFIER|OR4F5|ENSG00000186092|Transcript|ENST00000335137|protein_coding||||||||||rs759525926|1|3258|1||SNV|HGNC|HGNC:14825|YES|||P1|CCDS30547.1|ENSP00000334393|Q8NH21||UPI0000041BC1|||||||||||||||||||||||||||||||||||8.458|0.700325|||||,C|downstream_gene_variant|MODIFIER|OR4G11P|ENSG00000240361|Transcript|ENST00000492842|transcribed_unprocessed_pseudogene||||||||||rs759525926|1|1910|1||SNV|HGNC|HGNC:31276||||||||||||||||||||||||||||||||||||||||||||8.458|0.700325|||||,C|intron_variant|MODIFIER|OR4F5|ENSG00000186092|Transcript|ENST00000641515|protein_coding||2/2|ENST00000641515.2:c.9+224T>C|||||||rs759525926|1||1||SNV|HGNC|HGNC:14825||||||ENSP00000493376||A0A2U3U0J3|UPI000D1938F0|||||||||||||||||||||||||||||||||||8.458|0.700325|||||,C|downstream_gene_variant|MODIFIER|OR4G11P|ENSG00000240361|Transcript|ENST00000642116|processed_transcript||||||||||rs759525926|1|1681|1||SNV|HGNC|HGNC:31276|YES|||||||||||||||||||||||||||||||||||||||||||8.458|0.700325|||||
chr1 65851 . C T 522.75 PASS AC=5;AF=0.0009335;AN=5356;AS_BaseQRankSum=0;AS_FS=0;AS_InbreedingCoeff=0.0361;AS_MQ=26.59;AS_MQRankSum=-1.2;AS_QD=23.77;AS_ReadPosRankSum=0;AS_SOR=1.085;BaseQRankSum=0;DP=28101;ExcessHet=0;FS=0;InbreedingCoeff=0.362;MLEAC=65;MLEAF=0.012;MQ=26.76;MQRankSum=-1.15;QD=23.76;ReadPosRankSum=0.086;SOR=1.085;VQSLOD=-2.748e+00;culprit=MQ;CSQ=T|upstream_gene_variant|MODIFIER|OR4F5|ENSG00000186092|Transcript|ENST00000335137|protein_coding||||||||||rs1476411183|1|3204|1||SNV|HGNC|HGNC:14825|YES|||P1|CCDS30547.1|ENSP00000334393|Q8NH21||UPI0000041BC1|||||||||||||||||||||||||||||||||||3.808|0.258882|||||,T|downstream_gene_variant|MODIFIER|OR4G11P|ENSG00000240361|Transcript|ENST00000492842|transcribed_unprocessed_pseudogene||||||||||rs1476411183|1|1964|1||SNV|HGNC|HGNC:31276||||||||||||||||||||||||||||||||||||||||||||3.808|0.258882|||||,T|intron_variant|MODIFIER|OR4F5|ENSG00000186092|Transcript|ENST00000641515|protein_coding||2/2|ENST00000641515.2:c.9+278C>T|||||||rs1476411183|1||1||SNV|HGNC|HGNC:14825||||||ENSP00000493376||A0A2U3U0J3|UPI000D1938F0|||||||||||||||||||||||||||||||||||3.808|0.258882|||||,T|downstream_gene_variant|MODIFIER|OR4G11P|ENSG00000240361|Transcript|ENST00000642116|processed_transcript||||||||||rs1476411183|1|1735|1||SNV|HGNC|HGNC:31276|YES|||||||||||||||||||||||||||||||||||||||||||3.808|0.258882|||||
This VCF is now giving errors.
@enriquea please also specify the transcript ID index parameter as shown below. it seems in your VCF the transcript is found at index 6 rather than the third index which is the default value.
python pypgatk_cli.py vcf-to-proteindb --vcf vcf_vep/veped_snp_indel.sorted.vcf \
--input_fasta ensembl_files/Homo_sapiens.GRCh38.cdna.all.fa \
--gene_annotations_gtf ensembl_files/Homo_sapiens.GRCh38.104.gtf \
--annotation_field_name "CSQ" \
--transcript_index 6 \
--output_proteindb var_peptides.fa
I have now pushed commit 2b59d7d that helps solve this issue to read the transcript and consequence indices from the INFO field in the VCF header. So there is no need to specify the transcript_index and consequence_index parameters anymore.
Hi @ypriverol and @husensofteng ,
I followed the example in the documentation (https://pgatk.readthedocs.io/en/latest/pypgatk.html#variants-vcf-to-protein-sequences) and tried to generate a protein DB from a custom VEP-annotated VCF. The command exit without error, however, the output is empty.
Here is what I'm running:
Terminal output:
But there are no records in the output fasta file. Any idea?
Thanks