bigbio / py-pgatk

Python tools for proteogenomics analysis toolkit
Apache License 2.0
10 stars 11 forks source link

Error: Codon '-CT' is invalid #52

Closed enriquea closed 2 years ago

enriquea commented 2 years ago

Hi @ypriverol and @husensofteng ,

I'm trying to generate a protein fasta database from the VCF at https://github.com/ypriverol/pgdb-manuscript/blob/main/Nassar_et_al.vcf.

After annotating the VCF using the VEP tool, I ran the vcf-to-proteindb cli tool and I'm getting the following error:

Traceback (most recent call last):
  File "/Users/enrique/opt/anaconda3/lib/python3.7/site-packages/Bio/Seq.py", line 2621, in _translate_str
    amino_acids.append(forward_table[codon])
  File "/Users/enrique/opt/anaconda3/lib/python3.7/site-packages/Bio/Data/CodonTable.py", line 402, in __getitem__
    raise KeyError(codon)  # stop codon
KeyError: '-CT'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pypgatk/pypgatk_cli.py", line 49, in <module>
    main()
  File "pypgatk/pypgatk_cli.py", line 45, in main
    cli()
  File "/Users/enrique/opt/anaconda3/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/enrique/opt/anaconda3/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/enrique/opt/anaconda3/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/enrique/opt/anaconda3/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/enrique/opt/anaconda3/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/enrique/opt/anaconda3/lib/python3.7/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/enrique/opt/anaconda3/envs/py-pgatk/lib/python3.7/site-packages/pypgatk-0.0.19-py3.7.egg/pypgatk/commands/vcf_to_proteindb.py", line 71, in vcf_to_proteindb
    ensembl_data_service.vcf_to_proteindb(vcf, input_fasta, gene_annotations_gtf)
  File "/Users/enrique/opt/anaconda3/envs/py-pgatk/lib/python3.7/site-packages/pypgatk-0.0.19-py3.7.egg/pypgatk/ensembl/ensembl.py", line 650, in vcf_to_proteindb
    num_orfs)
  File "/Users/enrique/opt/anaconda3/envs/py-pgatk/lib/python3.7/site-packages/pypgatk-0.0.19-py3.7.egg/pypgatk/ensembl/ensembl.py", line 336, in get_orfs_vcf
    alt_orfs.append(alt_seq[n::].translate(translation_table))
  File "/Users/enrique/opt/anaconda3/lib/python3.7/site-packages/Bio/Seq.py", line 1185, in translate
    cds, gap=gap)
  File "/Users/enrique/opt/anaconda3/lib/python3.7/site-packages/Bio/Seq.py", line 2638, in _translate_str
    "Codon '{0}' is invalid".format(codon))
Bio.Data.CodonTable.TranslationError: Codon '-CT' is invalid

Any idea related to this issue?

Thanks

PD: If I run the tool restricted to missense_variant and coding_sequence_variant I get the database fine. So, I'm guessing the error is related to a particular consequence/transcript, but not 100% sure.

husensofteng commented 2 years ago

Hi @enriquea, could you post your command?

enriquea commented 2 years ago

Hi @enriquea, could you post your command?

Hi, sure!

python pypgatk/pypgatk_cli.py vcf-to-proteindb \
                          --config_file pypgatk/config/ensembl_config.yaml \
                          --vcf  "$vcf_i" \
                          --input_fasta   "$database_dir"/transcripts_GRCm39.fa \
                          --gene_annotations_gtf  "$database_dir"/Mus_musculus.GRCm39.104.gtf \
                          --output_proteindb out.fa \
                          --var_prefix var \
                          --transcript_index 6 \
                          --annotation_field_name 'CSQ'
husensofteng commented 2 years ago

I think it is related to ambiguous nucleotide letters X/N in the sequence file that is used for the translation. From the file names in your command, I can see that you are using a newer reference assembly than the VCF file (GRCm39 vs GRCm38/mm10), may be that has something causing the issue.

enriquea commented 2 years ago

I think it is related to ambiguous nucleotide letters X/N in the sequence file that is used for the translation. From the file names in your command, I can see that you are using a newer reference assembly than the VCF file (GRCm39 vs GRCm38/mm10), may be that has something causing the issue.

I've lifted over the data (GRCm38 -> GRCm39) to keep it consistent with a previous processed VCF.

Below the header (ignore please the QUAL field)

##fileformat=VCFv4.2
##hailversion=0.2.34-914bd8a10ca2
##contig=<ID=1,length=195154279,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=10,length=130530862,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=11,length=121973369,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=12,length=120092757,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=13,length=120883175,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=14,length=125139656,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=15,length=104073951,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=16,length=98008968,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=17,length=95294699,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=18,length=90720763,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=19,length=61420004,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=2,length=181755017,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=3,length=159745316,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=4,length=156860686,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=5,length=151758149,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=6,length=149588044,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=7,length=144995196,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=8,length=130127694,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=9,length=124359700,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456210.1,length=169725,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456211.1,length=241735,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456212.1,length=153618,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456219.1,length=175968,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456221.1,length=206961,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456233.2,length=559103,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456239.1,length=40056,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456354.1,length=195993,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456359.1,length=22974,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456360.1,length=31704,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456366.1,length=47073,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456367.1,length=42057,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456368.1,length=20208,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456370.1,length=26764,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456372.1,length=28664,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456378.1,length=31602,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456379.1,length=72385,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456381.1,length=25871,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456382.1,length=23158,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456383.1,length=38659,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456385.1,length=35240,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456387.1,length=24685,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456389.1,length=28772,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456390.1,length=24668,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456392.1,length=23629,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456394.1,length=24323,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=GL456396.1,length=21240,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584295.1,length=1976,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584296.1,length=199368,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584297.1,length=205776,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584298.1,length=184189,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584299.1,length=953012,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584300.1,length=182347,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584301.1,length=259875,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584302.1,length=155838,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584303.1,length=158099,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=JH584304.1,length=114452,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=MT,length=16299,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=MU069434.1,length=8412,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=MU069435.1,length=31129,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=X,length=169476592,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##contig=<ID=Y,length=91455967,assembly=Mus_musculus.GRCm39.dna.toplevel.fa>
##liftOverProgram=CrossMap(https://sourceforge.net/projects/crossmap/)
##liftOverFile=/net/isilonP/public/ro/ensweb-data/latest/tools/www/e104/assembly_converter/mus_musculus/GRCm38_to_GRCm39.chain.gz
##new_reference_genome=/net/isilonP/public/ro/ensweb-data/latest/tools/www/e104/assembly_converter/mus_musculus/Mus_musculus.GRCm39.dna.toplevel.fa
##liftOverTime=October21,2021
##VEP="v104" time="2021-10-21 11:10:43" cache="/net/isilonP/public/ro/ensweb-data/latest/tools/www/e104/vep/cache/mus_musculus/104_GRCm39" db="mus_musculus_core_104_39@hh-mysql-ens-species-web-1" assembly="GRCm39" dbSNP="150" gencode="GENCODE M27" regbuild="1.0" sift="sift"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|SIFT|CLIN_SIG|SOMATIC|PHENO|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   4416823 mut12762    T   A   -10.00  PASS    CSQ=A|stop_gained|HIGH|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||4415|4288|1430|R/*|Aga/Tga|||-1||MGI||||1||||||||||,A|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1   4417755 mut12756    A   T   -10.00  PASS    CSQ=T|stop_gained|HIGH|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||3483|3356|1119|L/*|tTg/tAg|||-1||MGI||||1||||||||||,T|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1   4417849 mut12760    T   A   -10.00  PASS    CSQ=A|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||3389|3262|1088|N/Y|Aat/Tat|||-1||MGI||||1||deleterious(0)||||||||,A|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1   4418162 mut12761    T   A   -10.00  PASS    CSQ=A|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||3076|2949|983|E/D|gaA/gaT|||-1||MGI||||1||tolerated(0.09)||||||||,A|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1   4418734 mut12755    A   C   -10.00  PASS    CSQ=C|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||2504|2377|793|C/G|Tgt/Ggt|||-1||MGI||||1||tolerated(0.26)||||||||,C|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1   4418964 mut12757    A   T   -10.00  PASS    CSQ=T|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||2274|2147|716|L/H|cTt/cAt|||-1||MGI||||1||deleterious(0.01)||||||||,T|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1   4419948 mut12759    G   A   -10.00  PASS    CSQ=A|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|4/4||||1290|1163|388|S/L|tCg/tTg|||-1||MGI||||1||tolerated(0.16)||||||||,A|intron_variant|MODIFIER|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding||3/29||||||||||-1||MGI||||5|P1|||||||||
1   4423018 mut12758    T   C   -10.00  PASS    CSQ=C|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000027032|protein_coding|2/4||||158|31|11|M/V|Atg/Gtg|||-1||MGI||||1||tolerated(0.96)||||||||,C|missense_variant|MODERATE|Rp1|ENSMUSG00000025900|Transcript|ENSMUST00000208660|protein_coding|2/30||||115|61|21|M/V|Atg/Gtg|||-1||MGI||||5|P1|tolerated_low_confidence(1)||||||||
husensofteng commented 2 years ago

Actually, the problem seems to be with the VCF that contains '-' instead of valid nucleotides in ref/allele columns for some of the variants. Could you please filter them out and try again, e.g.: awk 'BEGIN{FS=OFS="\t"}{if($4!="-" && $5!="-") print}' vars.vcf > vars_filtered.vcf

enriquea commented 2 years ago

Actually, the problem seems to be with the VCF that contains '-' instead of valid nucleotides in ref/allele columns for some of the variants. Could you please filter them out and try again, e.g.: awk 'BEGIN{FS=OFS="\t"}{if($4!="-" && $5!="-") print}' vars.vcf > vars_filtered.vcf

Thanks, I'll try and report back...

enriquea commented 2 years ago

Actually, the problem seems to be with the VCF that contains '-' instead of valid nucleotides in ref/allele columns for some of the variants. Could you please filter them out and try again, e.g.: awk 'BEGIN{FS=OFS="\t"}{if($4!="-" && $5!="-") print}' vars.vcf > vars_filtered.vcf

Thanks, I'll try and report back...

Ok, this actually made the trick. It removes some frameshift variants but everything runs perfect now!

Could be a good idea maybe to report these problematic variants rather than throwing an error?

Thanks for your help!

husensofteng commented 2 years ago

I have now implemented bb6700d to only allow variants that have proper nucleotide letters (ACGT) in the REF and ALT columns to be processed.