CapraLab / pdbmap

3 stars 1 forks source link

Handle Selenomethionine #19

Closed ChrisMoth closed 2 years ago

ChrisMoth commented 3 years ago

Selenomethionine, code MSE, letter U, is outside the core IUPAC 20 amino acids in Biopython.

However, we sometimes see U both in uniparc transcript sequence and in ENSEMBL transcripts Here is an example:

$ $ transcript_to_AAseq.pl ENST00000611653 | grep U MCASRDDWRCARSMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQUGKTEVNYTQLVDLHARYAECGLRILAFPCNQFGKQEPGSNEEIKEFAAGYNVKFDMFSKICVNGDDAHPLWKWMKIQPKGKGILGNAIKWNFTKFLIDKNGCVVKRYGPMEEPLVIEKDLPHYF

The fix is to allow U in the transcript sequences, and match to MSE on the structural side.

ChrisMoth commented 3 years ago

Additionally, the pipeline must do a better job of naming H_ hetero atoms generally. This is accomplished by looking more deeply into the mmcif dictionary than before to mon_nstd flag.