HUPO-PSI / psi-ms-CV

HUPO-PSI mass spectrometry CV
Other
28 stars 37 forks source link

allow negative score in MS:1002665? ('regular expression for interaction scores derived from cross-linking') #142

Closed colin-combe closed 2 years ago

colin-combe commented 2 years ago

Describe the question or discussion

Hi, would it be possible to just change MS:1002665 -- 'regular expression for interaction scores derived from cross-linking' -- so that it allows a negative score?

These appear in our results sometimes and i was wondering if the validation issues it causes could be solved by just changing the reg ex to allow them, best wishes, Colin

mobiusklein commented 2 years ago

We can make that change. Which tool are you using that parses that regular expression? I noticed that it uses POSIX character classes which are not compatible with ECMAScript regular expressions used in other places in the specification.

The current pattern: ([:digit:]+[.][a|b]:([:digit:]+|null):[:digit:]+[.][:digit:]+([Ee][+-][0-9]+)*:(true|false]\{1\})) The updated pattern: ([:digit:]+[.][a|b]:([:digit:]+|null):-?[:digit:]+[.][:digit:]+([Ee][+-][0-9]+)*:(true|false]\{1\}))

colin-combe commented 2 years ago

Thanks

Which tool are you using that parses that regular expression?

er, I'm not sure, the one the validator at https://github.com/HUPO-PSI/mzIdentML/tree/master/validator uses

(@lutzfischer)

mobiusklein commented 2 years ago

Thank you for following up. I apologize for the delay.

We can update the rule in the CV, but it turns out that the mzIdentML validator hard-codes its own copy of the regular expression (using non POSIX syntax). I can update the pattern in both places to permit the number to be negative.

Do you use a pre-built version of the validator downloaded from somewhere, or do you compile it locally? I'm increasingly... less confident in my ability to make Java projects from disparate IDEs "just build".

lutzfischer commented 2 years ago

I would maybe include an explicit + as well ([:digit:]+[.][a|b]:([:digit:]+|null):[\-+]?[:digit:]+[.][:digit:]+([Ee][+-][0-9]+)*:(true|false]\{1\})).

Also could somebody enlighten me about what is the meaning of the part ]\{1\}? The closing ] is not matching any opening one.

Possibly changed a bit to be more general (still with the odd part at the end):

([0-9]+[.][a|b]:([0-9]+|null):[\-+]?[0-9]+([.][0-9]+)?([Ee][+-][0-9]+)*:(true|false]\{1\})).

lutzfischer commented 2 years ago

Also just looked up posix regular expressions - if it is mend to be that, then the [:digit:] would need to be [[:digit:]] so not sure what flavour of regular expressions that is.

colin-combe commented 2 years ago

I apologize for the delay.

no worries Joshua, we meant to get back to you, as @lutzfischer just has

Do you use a pre-built version of the validator downloaded from somewhere, or do you compile it locally? I'm increasingly... less confident in my ability to make Java projects from disparate IDEs "just build".

We use the pre-built version in the mzIdentML repo - I will see how I get on rebuilding the validator and report any problems in the mzIdentML repo

mobiusklein commented 2 years ago

I've previously opened an issue on the CV about the regular expression dialect being incorrect. This is part of why I asked which software @colin-combe was using since it was unclear if that regular expression was being read from the CV anywhere, as opposed to someone re-writing it by hand in one of the more commonly supported dialects.

@lutzfischer that stray closing brace does not match anything appears to be a typo. The \{1\} also doesn't seem to be part of any real instances of the pattern.

The pattern must match the following construction from page ~22 of the mzIdentML spec doc: image

So ([0-9]+[.][a|b]:([0-9]+|null):([\-+]?[0-9]+([.][0-9]+)?([Ee][+-][0-9]+)*):(true|false)) is the actual required pattern and it matches all the examples from the spec doc and a few extras I threw in to exercise other branches:

100.b:null:0.001:true
106.b:146:-0.0294:true
106.a:436:0.0294:true
106.a:436:0.0294:false
106.a:436:+4e-5:true

I think we may want to mark some groups as non-capturing like the exponent group but that type of annotation is much more dialect specific. In HUPO-PSI/mzIdentML#64 I found some limited evidence that we're matching against the PCRE regex implementation, though there is some evidence that we are also using the ECMAScript dialect.

lutzfischer commented 2 years ago

I think the last * should be ? ([0-9]+[.][a|b]:([0-9]+|null):([\-+]?[0-9]+([.][0-9]+)?([Ee][+-][0-9]+)?):(true|false))

Also if we care about capturing I guess we should catch some other groups as well: ([0-9]+)[.]([a|b]):([0-9]+|null):([\-+]?[0-9]+([.][0-9]+)?([Ee][+-][0-9]+)?):(true|false)

group 1: ID group 2: site group 3: position group 4: score and (I think) group 7: pass

As you say, we could make 5 and 6 non capturing - but that would pin it more towards a some flavours of RE. ([0-9]+)[.]([a|b]):([0-9]+|null):([\-+]?[0-9]+(?:[.][0-9]+)?(?:[Ee][+-][0-9]+)?):(true|false)

chambm commented 2 years ago

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Or 3 or 4 in this case. :) The 0-9 instead of :digit: makes it more readable as well as compatible. Do our spec docs not specify which regex dialect we're requiring compatibility with (but still try to use lowest-common-denominator kind of syntax)?

mobiusklein commented 2 years ago

From the mzIdentML spec doc:

7.9 Enzyme definition The SHOULD contain a specification of which enzyme (if any) was applied in the search. The element has optional sub-elements for specifying the using a CV term and the cleavage site, using a regular expression. Regular expressions should be encoded following the notation of Perl Compatible Regular Expressions (PCRE regex, http://www.pcre.org, matching the syntax and semantics of Perl version 5). The PSI-MS CV contains terms for the most common enzymes with pre-defined regular expressions (Table 6). If the enzyme used is present in the PSI-MS CV, the term MUST be encoded under

unless the rule given in the CV does not match that used by the software or if the enzyme used is not present in the CV, in which case the regular expression used MUST be given in the element . If the element is used, the regular expression MAY also be provided additionally. For a no enzyme search, (i.e. one where there may be a cleavage at any residue), the CV term MS:1001091 ‘NoEnzyme’ MUST be specified, and the missedCleavages and semiSpecific attributes SHOULD NOT be specified. If two or more enzymes are used, multiple elements SHOULD be provided rather than trying to build a regular expression covering all cleavage sites. If the software uses a name for an enzyme other than the one specified in the CV, a user param term MAY also be given. The following guidelines SHOULD be followed when generating regular expressions in an instance document for enzymes not present in the CV: 1) use the PCRE supplied negation syntax for look-ahead and look-behind assertions and 2) use the most compact representation possible for a regex. The start of a match specifies the cleavage point. For example the enzyme trypsin, which cleaves following a K or R residue unless the next residue is P, has the regular expression: (?<=[KR])(?!P) The ?<= is a "zero-width positive look-behind assertion", and [] means one of this character set. So, this rule is to look behind for a K or R. ?! is a zero-width positive look-ahead assertion, and ?!P means any character that is not P. An example of an “N-term” enzyme is Asp-N which cleaves before D or B. This can be described using the PCRE: (?=[BD])

This is the only place the regex dialect appears to be discussed.

mobiusklein commented 2 years ago

([0-9]+)[.]([a|b]):([0-9]+|null):([\-+]?[0-9]+(?:[.][0-9]+)?(?:[Ee][+-][0-9]+)?):(true|false) looks right, and nicely chunks up my test cases and is compatible with PCRE, ECMAScript, and SRE which means Perl, PHP, Java, C#, C++, and Python's standard library regex implementations are all happy. Sadly, while named capture groups might make this nicer to consume, the syntax for those does diverge across the three regex implementations there (though PCRE supports all the syntaxes that the other two use only one of).

Obviously, the capture groups change indices so this breaks all the things that rely on those, but I believe we can safely say that nobody was directly reading the pattern from the CV because the pattern in the CV didn't match the example texts.

I'll make this change to the CV and then we'll see how terrible it will be to get the validator to build.

edeutsch commented 2 years ago

thanks, everyone!

colin-combe commented 2 years ago

what about the other places in the obo file that contain "[:digit:]" in regular expressions? Should all of these be fixed?

re. the Validator - https://github.com/HUPO-PSI/mzidentml-validator/issues/10 (I've seen something like that before and, whilst it's probably possible to find the source code and compile the libraries needed locally, it would be nice if it just built out of the box.)