Closed taylorreiter closed 2 months ago
Woops I meant to include this link:
My vote would be for option (3) which I think should only require modifying utils.verify_translation
.
But I have to say I'm a bit confused why this error is happening in the first place, because from brief experimentation, it looks like Bio.Seq.Seq.translate
does translate N
s to X
s e.g. this:
Seq('ATGNNN').translate()
outputs
Seq('MX')
Are we sure that the error raised when verify_translation
returns false is due to N
s and not to something else? (e.g. some alternative codon that orfipy uses but that Seq.translate
doesn't use).
I shall reinvestigate and report back! In the meantime thank you for voting on option 3. If this ends up being the culprit, I'll pursue a fix with that path.
Update: It is caused by the N's. Using our example sequence, the nucleotides get translated into:
VLYLLLWRMDELRMGTLVGVDKYGNKYYEDNRFFFGRNRWVEYADYYYFDYDGSQVXXNGMAGYITRLMCRQPRLIFLSTSGLHHIQRT
The peptide sequence is:
VLYLLLWRMDELRMGTLVGVDKYGNKYYEDNRFFFGRNRWVEYADYYYFDYDGSQXXXNGMAGYITRLMCRQPRLIFLSTSGLHHIQRT
You can see the only difference is the nucleotide sequence has VXX
while the amino acid one has XXX
. i think BioPython is being slightly clever here, where there is an N in the third space of the translation table but no matter what that N stands for, you'll get a V. In contrast, orfipy is being conservative; if there is any N, it translates to an X.
I'm going to brainstorm the best solution here. I think it might actually be to change verify translation so that if there is a mismatch between an X and and N, that's ok. Going to think more on this before implementing a solution.
Description of the bug
The python script extract_plmutils_nucleotide_sequences.py extracts and validates nucleotide sequences for peptides predicted by plm-utils.
With the recent update to plm-utils, I'm running into a new bug where the input sequences have "N"s in them to demarcate nucleotides that we're not sure about their sequence. orfipy & ESM accept these and they go through plmutils. However, when they get to the peptide -> nucleotide extraction step, the sequences don't match because we haven't recorded that unknown codons (anything that contains N) should make unknown amino acids (X).
I re-ran this script on a ~30 of TSA transcriptomes. I downloaded these transcriptomes and then predicted open reading frames using transdecoder. For 9 of them, peptigate doesn't finish because of this bug.
I'm wondering what the best path forward is. I can think of two patches:
Command used and terminal output
Relevant files
Peptide sequence:
Nucleotide sequence:
System information
Conda env: