Orfipy outputs ORFs that contain the amino acid X, which breaks nucleotide conversion

taylorreiter commented 2 months ago

Description of the bug

The python script extract_plmutils_nucleotide_sequences.py extracts and validates nucleotide sequences for peptides predicted by plm-utils.

With the recent update to plm-utils, I'm running into a new bug where the input sequences have "N"s in them to demarcate nucleotides that we're not sure about their sequence. orfipy & ESM accept these and they go through plmutils. However, when they get to the peptide -> nucleotide extraction step, the sequences don't match because we haven't recorded that unknown codons (anything that contains N) should make unknown amino acids (X).

I re-ran this script on a ~30 of TSA transcriptomes. I downloaded these transcriptomes and then predicted open reading frames using transdecoder. For 9 of them, peptigate doesn't finish because of this bug.

I'm wondering what the best path forward is. I can think of two patches:

filter out anything with an N from the start. This might remove some interesting candidates where the peptide wouldn't have an N in it anyway
filter out any peptide with an X after plmutils prediction is done but before nucleotide prediction is done. This could still eliminate some interesting peptides, but from a follow up perspective I don't know what someone would do with a peptide sequence with a bunch of XXXs. I guess you could BLAST it and try and find something reasonable to back fill the XXXs with.
Allow all Xs through. Adapt the nucleotide extraction code to have anything with an N nucleotide match and X amino acid.

Command used and terminal output

$ python scripts/extract_plmutils_nucleotide_sequences.py \
--nucleotide_fasta_file outputs/acc/sORF/filtering/contigs_with_no_predicted_orf_and_no_uniref50_blast_hit.fna \
--protein_peptides_fasta_file outputs/acc/sORF/plmutils/plmutils_peptides.faa \
--nucleotide_peptides_output_file outputs/acc/sORF/plmutils/plmutils_peptides.ffn

Traceback (most recent call last):
  File "/home/ubuntu/peptigate/scripts/extract_plmutils_nucleotide_sequences.py", line 101, in <module>
    main()
  File "/home/ubuntu/peptigate/scripts/extract_plmutils_nucleotide_sequences.py", line 93, in main
    extract_nucleotide_peptide_sequences(
  File "/home/ubuntu/peptigate/scripts/extract_plmutils_nucleotide_sequences.py", line 59, in extract_nucleotide_peptide_sequences
    raise ValueError(
ValueError: Warning: Translation mismatch for acc03.1. Check the reading frame and positions.

Relevant files

Peptide sequence:

>acc03.1 acc03.1_ORF.4 [150-417](-) type:complete length:267 frame:-1 start:GTG stop:TGA
VLYLLLWRMDELRMGTLVGVDKYGNKYYEDNRFFFGRNRWVEYADYYYFDYDGSQXXXNG
MAGYITRLMCRQPRLIFLSTSGLHHIQRT

Nucleotide sequence:

>acc03.1 TSA: transcribed RNA sequence
GTTTTATATTTACAGGCTCTCTATTCTCACAATACAGCAGTGGGGTCAACTATGCGACCA
CTCACTTCCTCGCTGTCTTGGGTGGCACCCATGGCTGAATCTTGGGTGGGACAGTGCTGT
ATGGCATGTACGCGTCCTTGGTGCCGCTCATGTTCTCTGTATGTGGTGCAAGCCACTTGT
ACTGAGGAAGATCAGCCTTGGTTGGCGGCACATCAGTCGTGTAATGTAACCAGCCATGCC
ATTCGNNNNNACTTGGCTTCCGTCGTAATCAAAGTAGTAGTAGTCAGCGTACTCAACCCA
CCGGTTGCGTCCGAAGAAGAATCTGTTGTCCTCGTAGTACTTGTTTCCATACTTGTCGAC
GCCTACCAGCGTCCCCATTCGCAACTCGTCCATTCTCCACAACAGCAAGTAAAGCACACG
CCTGGGGCTTCAAGAATTGCCTGGAAAAATGCAGCCAGTACACTACGCAAAGGCACAACA
AATTGT

System information

Conda env:

channels:
   - conda-forge
   - bioconda
   - defaults
dependencies:
   - biopython=1.83

taylorreiter commented 2 months ago

Woops I meant to include this link:

Orfipy permissive of unknown nucleotides: https://github.com/urmi-21/orfipy/blob/1c6aa488bb16a2d210717722657021100fb2a042/orfipy/orfipy_core.pyx#L559

keithchev commented 2 months ago

My vote would be for option (3) which I think should only require modifying utils.verify_translation.

But I have to say I'm a bit confused why this error is happening in the first place, because from brief experimentation, it looks like Bio.Seq.Seq.translate does translate Ns to Xs e.g. this:

Seq('ATGNNN').translate()

outputs

Seq('MX')

Are we sure that the error raised when verify_translation returns false is due to Ns and not to something else? (e.g. some alternative codon that orfipy uses but that Seq.translate doesn't use).

taylorreiter commented 2 months ago

I shall reinvestigate and report back! In the meantime thank you for voting on option 3. If this ends up being the culprit, I'll pursue a fix with that path.

taylorreiter commented 2 months ago

Update: It is caused by the N's. Using our example sequence, the nucleotides get translated into:

VLYLLLWRMDELRMGTLVGVDKYGNKYYEDNRFFFGRNRWVEYADYYYFDYDGSQVXXNGMAGYITRLMCRQPRLIFLSTSGLHHIQRT

The peptide sequence is:

VLYLLLWRMDELRMGTLVGVDKYGNKYYEDNRFFFGRNRWVEYADYYYFDYDGSQXXXNGMAGYITRLMCRQPRLIFLSTSGLHHIQRT

You can see the only difference is the nucleotide sequence has VXX while the amino acid one has XXX. i think BioPython is being slightly clever here, where there is an N in the third space of the translation table but no matter what that N stands for, you'll get a V. In contrast, orfipy is being conservative; if there is any N, it translates to an X.

I'm going to brainstorm the best solution here. I think it might actually be to change verify translation so that if there is a mismatch between an X and and N, that's ok. Going to think more on this before implementing a solution.

Arcadia-Science / peptigate