ContentMine / ami

Apache License 2.0
13 stars 14 forks source link

question: how can I make regex greedy? #72

Open egonw opened 7 years ago

egonw commented 7 years ago

I have a regular expression file:

<compoundRegex title="jrc">
    <regex fields="jrc">NM[-]?\d\d\d[K]?</regex>
</compoundRegex>

This find NM-100, etc, etc. However, it also find (ctj JSON output):

    {
      "pre": " OECD reference material ",
      "name0": "jrc",
      "value0": "NM-300",
      "post": "K; nano-AgB 5–50 nm) on t",
      "pmc": "PMC3841577"
    },

Here, I would like it to "find" NM-300K, for which the regex needs to be greedy... anyway of instructing this in the above config file?

petermr commented 7 years ago

What are the rangs of values?

On Sun, Jul 30, 2017 at 1:23 PM, Egon Willighagen notifications@github.com wrote:

I have a regular expression file:

NM[-]?\d\d\d[K]?

This find NM-100, etc, etc. However, it also find (ctj JSON output):

{
  "pre": " OECD reference material ",
  "name0": "jrc",
  "value0": "NM-300",
  "post": "K; nano-AgB 5–50 nm) on t",
  "pmc": "PMC3841577"
},

Here, I would like it to "find" NM-300K, for which the regex needs to be greedy... anyway of instructing this in the above config file?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ContentMine/ami/issues/72, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsxS2y964cxeaZmpaDc-_73PelI9S3dks5sTHXBgaJpZM4Onnvm .

-- Peter Murray-Rust Reader Emeritus University of Cambridge +44-1223-763069 and ContentMine Ltd

egonw commented 7 years ago

You mean the length of the pre/post text? Then: 25, 25

See full project notebook: https://gist.github.com/egonw/2779c0628da0b24b7a113bdc9e0c1a65

Possible values: NM-300K, NM-100, NM101, etc.

petermr commented 7 years ago

Not quite sure what your greedy problem is. You can use ranges , e.g. [\d]{5,6} or [\d]{5,}

On Sun, Jul 30, 2017 at 5:38 PM, Egon Willighagen notifications@github.com wrote:

25, 25

See full project notebook: https://gist.github.com/egonw/ 2779c0628da0b24b7a113bdc9e0c1a65

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/ami/issues/72#issuecomment-318913285, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsxSzvX72pri3u1aJZia7wnvGTSXpsiks5sTLF3gaJpZM4Onnvm .

-- Peter Murray-Rust Reader Emeritus University of Cambridge +44-1223-763069 and ContentMine Ltd

egonw commented 7 years ago

It matches enough digits, but indeed, only if I make the greediness hard coded... but getting it to match the optional K in NM-300K I have not gotten to work. But no worries! I'll have a look at the ami2-regex code soon. (Feel free to assign the issue to me, something I cannot do myself...)

For now, I'm happy with my progress. I will still try to sit down with @larsgw to see if we can get his cards to work on ami2-regex output (he made some initial patches yesterday), and I was able to do something nice with the data already (see https://egonw.github.io/cmnanotox/network.html):

image

petermr commented 7 years ago

I would try https://www.regexbuddy.com/ to learn what works. AMI implements Java7 regexes which should certainly be powerful enough for most things.

On Mon, Jul 31, 2017 at 11:58 AM, Egon Willighagen <notifications@github.com

wrote:

It matches enough digits, but indeed, only if I make the greediness hard coded... but getting it to match the optional K in NM-300K I have not gotten to work. But no worries! I'll have a look at the ami2-regex code soon. (Feel free to assign the issue to me, something I cannot do myself...)

For now, I'm happy with my progress. I will still try to sit down with @larsgw https://github.com/larsgw to see if we can get his cards to work on ami2-regex output (he made some initial patches yesterday), and I was able to do something nice with the data already (see https://egonw.github.io/cmnanotox/network.html):

[image: image] https://user-images.githubusercontent.com/26721/28774888-652e145a-75bd-11e7-8450-0651588387c6.png

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/ami/issues/72#issuecomment-319035656, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsxS5M1QfVvnW66jKK-gXQP0xEucE3Rks5sTbM4gaJpZM4Onnvm .

-- Peter Murray-Rust Reader Emeritus University of Cambridge +44-1223-763069 and ContentMine Ltd

dan2097 commented 7 years ago

Java regex's default behaviour is greedy, so this case should work. One gotcha is that when trying alternatives Java's regex engine will go with the one that matches first e.g. AB(\d|\d\d) will prefer to only match one digit. This issue doesn't obviously seem to apply to the example you gave though... If you use:

NM[-]?\d\d\d[K]?(?=(\W|$))

Do you observe the same issue? (this is a zero width positive lookahead for either a "non-word" character or the end of the document)

egonw commented 7 years ago

Thanks, I will try this (not sure when, as teaching season starts again next week!

egonw commented 7 years ago

@dan2097, If I use the following:

<regex fields="jrc" weight="1.0">NM[-]?\d\d\d[K]?(?=(\W|$))"</regex>

Then it does not find anything at all anymore...

My full workflow can be found at https://github.com/egonw/cmnanotox

petermr commented 7 years ago

NM-300K

May be useful to escape the minus sign I would use: NM-\d{3}K?

On Sun, Aug 20, 2017 at 3:02 PM, Egon Willighagen notifications@github.com wrote:

If I use the following:

NM[-]?\d\d\d[K]?(?=(\W|$))"

Then it does not find anything at all anymore...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/ami/issues/72#issuecomment-323586875, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsxSym5INgEUnpzhA6WX28ELteCf73Oks5saDxpgaJpZM4Onnvm .

-- Peter Murray-Rust Reader Emeritus University of Cambridge +44-1223-763069 and ContentMine Ltd

dan2097 commented 7 years ago

@egonw In your comment there appears to be a quotation mark (") before \. Is that actually there? If so that would be why it doesn't match ;-) [-] is a bit unconventional as - is also used for character ranges e.g. [1-9], but the special meaning for hyphen does only apply when it's surrounded by two characters, so [-] is fine. NM-?\d{3}K?(?=(\W|$)) Is shorter though.

larsgw commented 7 years ago

Possibly related issues I found while trying things: #73, #74

egonw commented 7 years ago

@dan2097, I tried with the following file (without hits):

<compoundRegex title="jrc">
  <regex fields="jrc" weight="2.0">NM-?\d{3}K?(?=(\W|$))</regex>
</compoundRegex>

Meanwhile, I decided to just move ahead and create a dedicated plugin, which gives me more control https://github.com/egonw/ami/commit/18edaa5c293eddf7aae466d140926ee1a913ed14