Closed enriquea closed 3 years ago
Hi @enriquea, could you post your command?
Hi @enriquea, could you post your command?
Hi, sure!
python pypgatk/pypgatk_cli.py vcf-to-proteindb \
--config_file pypgatk/config/ensembl_config.yaml \
--vcf "$vcf_i" \
--input_fasta "$database_dir"/transcripts_GRCm39.fa \
--gene_annotations_gtf "$database_dir"/Mus_musculus.GRCm39.104.gtf \
--output_proteindb out.fa \
--var_prefix var \
--transcript_index 6 \
--annotation_field_name 'CSQ'
I think it is related to ambiguous nucleotide letters X/N in the sequence file that is used for the translation. From the file names in your command, I can see that you are using a newer reference assembly than the VCF file (GRCm39 vs GRCm38/mm10), may be that has something causing the issue.
I think it is related to ambiguous nucleotide letters X/N in the sequence file that is used for the translation. From the file names in your command, I can see that you are using a newer reference assembly than the VCF file (GRCm39 vs GRCm38/mm10), may be that has something causing the issue.
I've lifted over the data (GRCm38 -> GRCm39) to keep it consistent with a previous processed VCF.
Below the header (ignore please the QUAL field)
##fileformat=VCFv4.2
##hailversion=0.2.34-914bd8a10ca2
##contig=<ID=1,length=195154279,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=10,length=130530862,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=11,length=121973369,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=12,length=120092757,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=13,length=120883175,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=14,length=125139656,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=15,length=104073951,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=16,length=98008968,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=17,length=95294699,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=18,length=90720763,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=19,length=61420004,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=2,length=181755017,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=3,length=159745316,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=4,length=156860686,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=5,length=151758149,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=6,length=149588044,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=7,length=144995196,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=8,length=130127694,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=9,length=124359700,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456210.1,length=169725,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456211.1,length=241735,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456212.1,length=153618,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456219.1,length=175968,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456221.1,length=206961,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456233.2,length=559103,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456239.1,length=40056,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456354.1,length=195993,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456359.1,length=22974,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456360.1,length=31704,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456366.1,length=47073,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456367.1,length=42057,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456368.1,length=20208,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456370.1,length=26764,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456372.1,length=28664,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456378.1,length=31602,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456379.1,length=72385,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456381.1,length=25871,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456382.1,length=23158,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456383.1,length=38659,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456385.1,length=35240,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456387.1,length=24685,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456389.1,length=28772,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456390.1,length=24668,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456392.1,length=23629,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456394.1,length=24323,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456396.1,length=21240,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584295.1,length=1976,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584296.1,length=199368,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584297.1,length=205776,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584298.1,length=184189,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584299.1,length=953012,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584300.1,length=182347,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584301.1,length=259875,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584302.1,length=155838,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584303.1,length=158099,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584304.1,length=114452,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=MT,length=16299,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=MU069434.1,length=8412,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=MU069435.1,length=31129,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=X,length=169476592,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=Y,length=91455967,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##liftOverProgram=CrossMap(https://sourceforge.net/projects/crossmap/)
##liftOverFile=/net/isilonP/public/ro/ensweb-data/latest/tools/www/e104/assembly_converter/mus_musculus/GRCm38_to_GRCm39.chain.gz
##new_reference_genome=/net/isilonP/public/ro/ensweb-data/latest/tools/www/e104/assembly_converter/mus_musculus/Mus_musculus.GRCm39.dna.toplevel.fa
##liftOverTime=October21,2021
##VEP="v104" time="2021-10-21 11:10:43" cache="/net/isilonP/public/ro/ensweb-data/latest/tools/www/e104/vep/cache/mus_musculus/104_GRCm39" db="mus_musculus_core_104_39@hh-mysql-ens-species-web-1" assembly="GRCm39" dbSNP="150" gencode="GENCODE M27" regbuild="1.0" sift="sift"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|SIFT|CLIN_SIG|SOMATIC|PHENO|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 4416823 mut12762 T A -10.00 PASS CSQ=A|stop_gained|HIGH|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||4415|4288|1430|R/*|Aga/Tga|||-1||MGI||||1||||||||||,A|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1 4417755 mut12756 A T -10.00 PASS CSQ=T|stop_gained|HIGH|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||3483|3356|1119|L/*|tTg/tAg|||-1||MGI||||1||||||||||,T|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1 4417849 mut12760 T A -10.00 PASS CSQ=A|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||3389|3262|1088|N/Y|Aat/Tat|||-1||MGI||||1||deleterious(0)||||||||,A|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1 4418162 mut12761 T A -10.00 PASS CSQ=A|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||3076|2949|983|E/D|gaA/gaT|||-1||MGI||||1||tolerated(0.09)||||||||,A|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1 4418734 mut12755 A C -10.00 PASS CSQ=C|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||2504|2377|793|C/G|Tgt/Ggt|||-1||MGI||||1||tolerated(0.26)||||||||,C|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1 4418964 mut12757 A T -10.00 PASS CSQ=T|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||2274|2147|716|L/H|cTt/cAt|||-1||MGI||||1||deleterious(0.01)||||||||,T|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1 4419948 mut12759 G A -10.00 PASS CSQ=A|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||1290|1163|388|S/L|tCg/tTg|||-1||MGI||||1||tolerated(0.16)||||||||,A|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1 4423018 mut12758 T C -10.00 PASS CSQ=C|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|2/4||||158|31|11|M/V|Atg/Gtg|||-1||MGI||||1||tolerated(0.96)||||||||,C|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding|2/30||||115|61|21|M/V|Atg/Gtg|||-1||MGI||||5|P1|tolerated_low_confidence(1)||||||||
Actually, the problem seems to be with the VCF that contains '-' instead of valid nucleotides in ref/allele columns for some of the variants. Could you please filter them out and try again, e.g.:
awk 'BEGIN{FS=OFS="\t"}{if($4!="-" && $5!="-") print}' vars.vcf > vars_filtered.vcf
Actually, the problem seems to be with the VCF that contains '-' instead of valid nucleotides in ref/allele columns for some of the variants. Could you please filter them out and try again, e.g.:
awk 'BEGIN{FS=OFS="\t"}{if($4!="-" && $5!="-") print}' vars.vcf > vars_filtered.vcf
Thanks, I'll try and report back...
Actually, the problem seems to be with the VCF that contains '-' instead of valid nucleotides in ref/allele columns for some of the variants. Could you please filter them out and try again, e.g.:
awk 'BEGIN{FS=OFS="\t"}{if($4!="-" && $5!="-") print}' vars.vcf > vars_filtered.vcf
Thanks, I'll try and report back...
Ok, this actually made the trick. It removes some frameshift variants but everything runs perfect now!
Could be a good idea maybe to report these problematic variants rather than throwing an error?
Thanks for your help!
I have now implemented bb6700d to only allow variants that have proper nucleotide letters (ACGT) in the REF and ALT columns to be processed.
Hi @ypriverol and @husensofteng ,
I'm trying to generate a protein fasta database from the VCF at
https://github.com/ypriverol/pgdb-manuscript/blob/main/Nassar_et_al.vcf
.After annotating the VCF using the VEP tool, I ran the
vcf-to-proteindb
cli tool and I'm getting the following error:Any idea related to this issue?
Thanks
PD: If I run the tool restricted to
missense_variant
andcoding_sequence_variant
I get the database fine. So, I'm guessing the error is related to a particular consequence/transcript, but not 100% sure.