Villen-Lab / pyAscore

A python package for fast post translational modification localization, powered by Cython.
https://pyascore.readthedocs.io/
MIT License
18 stars 5 forks source link

Update PSM extraction from mzIdentML files #30

Closed AnthonyOfSeattle closed 2 years ago

AnthonyOfSeattle commented 2 years ago

This PR addresses the issue raised in #29. The main problem lies in the fact that mzIdentML files encode the origin spectrum for each PSM as a string. For Comet and MS-GF+, this string includes an entry that has the form "scan=#". However, it can also include a lot of other metadata. Originally, I was assuming the only metadata was the scan number, but I realize now that was incorrect. By updating the regex in the mzIdentML parser, I was able to handle the cases where there is an explicit "scan" entry, and I just decided to take the first number when that element is not present.

While fixing this initial bug, I also found a bug where some mzIdentML files do not include the modified residue in the entry for peptide modifications. I would search for this key, and would just skip modification extraction if even one wasn't present. This PR fixes this bug so that when the modified residue is not present, I take the residue from the unmodified peptide sequence.

For both of these issues, I updated the mzIdentML in the test/example_input folder to include currently known edge cases. This should serve as a sufficient test.