Bugfix/multiple static mod set

jspaezp commented 2 years ago

This PR Addresses issues:

https://github.com/bittremieux/spectrum_utils/issues/25 - Adds a warning when setting static modifications twice
https://github.com/bittremieux/spectrum_utils/issues/18 - Completes the implementation of proforma annotations in the package, as defined by passing all tests for compliance with the 2.0 standard (except for one mentioned later)

Currently unsolved:

How should synonyms be handled for glycans (NeuAc is not in the keys, but Neu5Ac is and they are the same....)? (I dont really work with glycans so I dont know how prevalent this issue is)

 assert (proforma.parse('{Glycan:Hex}{Glycan:NeuAc}EMEVNESPEK') ==
         ('EMEVNESPEK', {}))

The parser does not return information on uncertainty, therefore its not possible to get the annotation from the parsed sequence (in some cases), and makes the calculation of precursor masses wrong for any case that has unknown position or labile modifications.
Should a code formatter be implemented for the future of the project (I generaly use black but it makes a lot of changes in this project and I don't want to screw up your "blame")

Let me know what you think. Kindest wishes Sebastian

mwang87 commented 2 years ago

I'm sure @bittremieux will comment and review, but I just wanted to say @jspaezp, glad you're putting these fixes in, really makes the grammar and parser more complete.

bittremieux commented 2 years ago

Great stuff @jspaezp, much appreciated. I'm going to hold off on merging atm though because I've been working on some ProForma fixes as well to fully support the standard (including ambiguous modification locations, etc.). That addresses most of these issues as well, but the backend code has changed quite a bit. It might be easiest to cherry-pick the relevant changes afterwards and include them in the code.

Output from the ProForma parser now uses classes to encode all the information and would look something like this:

{Glycan:HexNAcHex2}[iTRAQ4plex]-EM[UNIMOD:35]EVNES[Obs:+79.966|Phospho|Sulfo]PEK[U:iTRAQ4plex]-[Methyl]

->

[Proteoform(sequence='EMEVNESPEK', modifications=[Modification(mass=527.185019356, position='labile', source=[Glycan(composition=[Monosaccharide(monosaccharide='HexNAc', count=1), Monosaccharide(monosaccharide='Hex', count=2)])], label=None), Modification(mass=144.102063, position='N-term', source=[CvEntry(controlled_vocabulary='UNIMOD', accession='UNIMOD:214', name='iTRAQ4plex')], label=None), Modification(mass=15.994915, position=1, source=[CvEntry(controlled_vocabulary='UNIMOD', accession='UNIMOD:35', name='Oxidation')], label=None), Modification(mass=79.966, position=6, source=[Mass(mass=79.966, controlled_vocabulary=None), CvEntry(controlled_vocabulary='UNIMOD', accession='UNIMOD:21', name='Phospho'), CvEntry(controlled_vocabulary='UNIMOD', accession='UNIMOD:40', name='Sulfo')], label=None), Modification(mass=144.102063, position=9, source=[CvEntry(controlled_vocabulary='UNIMOD', accession='UNIMOD:214', name='iTRAQ4plex')], label=None), Modification(mass=14.01565, position='C-term', source=[CvEntry(controlled_vocabulary='UNIMOD', accession='UNIMOD:34', name='Methyl')], label=None)], charge=None)]

It's a bit more verbose to handle those classes in downstream applications instead of the previous simple modification dict, but it includes all information specified in the ProForma string. That's an important aspect of supporting the full specification imo. And most users probably won't work with that output directly, but rather spectrum_utils.spectrum will consume it.

What do you think about it? What is your use case and does that work with this functionality?

Sorry that these changes aren't in an updated version yet. I've been pretty busy/lazy. 😳

Another question that I've noticed in your code changes is that you would making resolving modification terms in the various controlled vocabularies the default behavior. I did it the other way around, because this adds a delay the first time when the files are retrieved from their online sources and some secret cache directory is used. This might be unexpected behavior. Instead, if the user needs to explicitly specify to match against the CVs, then at least they should be aware of side effects (longer runtime + cache used). Your reasoning is that users probably want to do this most of the time, and hence, to make that the default?

I totally agree with using a code formatter. For a long time I've just been using pep8, but I've started to adopt black in the past year as well. I'll add a black autoformatter GitHub Action to ensure a consistent code style.

jspaezp commented 2 years ago

Hey @bittremieux ! Glad to hear some work was getting on that front, regarding the verbosity of the class based representation vs the list/dict one, in my experience a lot can be alleviated by just implementing a nice repr.

That's an important aspect of supporting the full specification imo. And most users probably won't work with that output directly, but rather spectrum_utils.spectrum will consume it. I can see how I could use it directly as library code in a project I am working on but whatever implementation/output the parser has will work (might need a an hour or so more reading the code but nothing crazy). I do agree that having full support for retaining information is critical (It would also make a lot easier to have a parsed > sequence conversion).

What do you think about it? What is your use case and does that work with this functionality? long story short ...ML and I am parsing/annotating a lot of spectra, and this library already handled that really well. Nontheless I currently parse the sequences on my end and I figured I might as well help making a good parser better rather than making my bad parser "ok"

Sorry that these changes aren't in an updated version yet. I've been pretty busy/lazy. flushed LMK if you want to upload them to a devel branch an we could work on it.

Another question that I've noticed in your code changes is that you would making resolving modification terms in the various controlled vocabularies the default behavior. I did it the other way around, because this adds a delay the first time when the files are retrieved from their online sources and some secret cache directory is used. This might be unexpected behavior. Instead, if the user needs to explicitly specify to match against the CVs, then at least they should be aware of side effects (longer runtime + cache used). Your reasoning is that users probably want to do this most of the time, and hence, to make that the default? For the most part I based my design choices on more of a "test driven development", since a lot of the tests for the intended behavior of the parser were already written, so instead of just adding the "resolve" argument to half of the lines in your test files I changed de default in the function. If you are going for a direct object representation, I thing the ideal behavior would be to make those a property, so they are "lazily" evaluated when needed.

I totally agree with using a code formatter. For a long time I've just been using pep8, but I've started to adopt black in the past year as well. I'll add a black autoformatter GitHub Action to ensure a consistent code style. Nice, would you mind adding the local checks you would like to be made on the code on a contributions.md ? (is there one? I didn't really see it, sorry about that)

on the other hand, might I suggest using parametrized tests for this particular cases ? I think it would make development/debugging a lot easier without sacrificing much readability (https://docs.pytest.org/en/6.2.x/example/parametrize.html).

thanks for the quick reply and look forward to helping with anything

bittremieux commented 2 years ago

long story short ...ML and I am parsing/annotating a lot of spectra, and this library already handled that really well. Nontheless I currently parse the sequences on my end and I figured I might as well help making a good parser better rather than making my bad parser "ok"

Haha, that was exactly my use case as well. I used to write variations on the same code for every project I worked on, and then decided to extract that functionality in a package for easier re-use. I'm glad it's been useful for other people as well.

If you are going for a direct object representation, I thing the ideal behavior would be to make those a property, so they are "lazily" evaluated when needed.

I hadn't thought about that yet, but that's a great suggestion! I'm working on adding lazy evaluation of the CV terms now.

Nice, would you mind adding the local checks you would like to be made on the code on a contributions.md ? (is there one? I didn't really see it, sorry about that)

Yes, I'll have to add that info. For a long time this was just a solo project, and I haven't always been using best practices for open source collaboration. I should properly address that.

on the other hand, might I suggest using parametrized tests for this particular cases ? I think it would make development/debugging a lot easier without sacrificing much readability

Hmm, I checked the documentation on parametrized tests. It's nice, but I don't immediately see how it would really facilitate the testing code. If you look at the updated tests, it is indeed very repetitive and a lot of lines. But even with parametrized tests that would still be the case I think?

bittremieux commented 2 years ago

The following functionality has now been added through parallel development:

[X] Full ProForma support (barring a few really minor edge cases). #18
[X] All glycan synonyms are considered.
[X] Unlocalized modifications (labile, unknown position) are returned as well, so e.g. precursor mass can be correctly computed.
[X] Terms are retrieved from their controlled vocabularies in a lazy fashion. Great suggestion!
[X] A platform-dependent cache directory is used based on the appdirs package.

This is currently still undocumented (#28), but it is already fully unit tested. Documentation will be added shortly.

Parsing a ProForma string using a formal grammar is a very elegant and extremely robust solution, in contrast to when using combinations of regexes. So imo as soon as it is properly documented, this can be considered the reference implementation.

Contributing guidelines and automatic code formatting will be added separately. (#27)

I think I will fully base peptide annotation and mass calculation on ProForma. So the explicit functionality to set static modifications will be removed and instead it can be specified in the peptide strings. This will make #25 obsolete.

I'm going to close this PR meanwhile, as most changes are now outdated compared to the updated ProForma functionality.

bittremieux / spectrum_utils

Bugfix/multiple static mod set #26