Open egonw opened 7 years ago
What are the rangs of values?
On Sun, Jul 30, 2017 at 1:23 PM, Egon Willighagen notifications@github.com wrote:
I have a regular expression file:
NM[-]?\d\d\d[K]? This find NM-100, etc, etc. However, it also find (ctj JSON output):
{ "pre": " OECD reference material ", "name0": "jrc", "value0": "NM-300", "post": "K; nano-AgB 5–50 nm) on t", "pmc": "PMC3841577" },
Here, I would like it to "find" NM-300K, for which the regex needs to be greedy... anyway of instructing this in the above config file?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ContentMine/ami/issues/72, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsxS2y964cxeaZmpaDc-_73PelI9S3dks5sTHXBgaJpZM4Onnvm .
-- Peter Murray-Rust Reader Emeritus University of Cambridge +44-1223-763069 and ContentMine Ltd
You mean the length of the pre/post text? Then: 25, 25
See full project notebook: https://gist.github.com/egonw/2779c0628da0b24b7a113bdc9e0c1a65
Possible values: NM-300K, NM-100, NM101, etc.
Not quite sure what your greedy problem is. You can use ranges , e.g. [\d]{5,6} or [\d]{5,}
On Sun, Jul 30, 2017 at 5:38 PM, Egon Willighagen notifications@github.com wrote:
25, 25
See full project notebook: https://gist.github.com/egonw/ 2779c0628da0b24b7a113bdc9e0c1a65
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/ami/issues/72#issuecomment-318913285, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsxSzvX72pri3u1aJZia7wnvGTSXpsiks5sTLF3gaJpZM4Onnvm .
-- Peter Murray-Rust Reader Emeritus University of Cambridge +44-1223-763069 and ContentMine Ltd
It matches enough digits, but indeed, only if I make the greediness hard coded... but getting it to match the optional K in NM-300K I have not gotten to work. But no worries! I'll have a look at the ami2-regex code soon. (Feel free to assign the issue to me, something I cannot do myself...)
For now, I'm happy with my progress. I will still try to sit down with @larsgw to see if we can get his cards to work on ami2-regex output (he made some initial patches yesterday), and I was able to do something nice with the data already (see https://egonw.github.io/cmnanotox/network.html):
I would try https://www.regexbuddy.com/ to learn what works. AMI implements Java7 regexes which should certainly be powerful enough for most things.
On Mon, Jul 31, 2017 at 11:58 AM, Egon Willighagen <notifications@github.com
wrote:
It matches enough digits, but indeed, only if I make the greediness hard coded... but getting it to match the optional K in NM-300K I have not gotten to work. But no worries! I'll have a look at the ami2-regex code soon. (Feel free to assign the issue to me, something I cannot do myself...)
For now, I'm happy with my progress. I will still try to sit down with @larsgw https://github.com/larsgw to see if we can get his cards to work on ami2-regex output (he made some initial patches yesterday), and I was able to do something nice with the data already (see https://egonw.github.io/cmnanotox/network.html):
[image: image] https://user-images.githubusercontent.com/26721/28774888-652e145a-75bd-11e7-8450-0651588387c6.png
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/ami/issues/72#issuecomment-319035656, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsxS5M1QfVvnW66jKK-gXQP0xEucE3Rks5sTbM4gaJpZM4Onnvm .
-- Peter Murray-Rust Reader Emeritus University of Cambridge +44-1223-763069 and ContentMine Ltd
Java regex's default behaviour is greedy, so this case should work. One gotcha is that when trying alternatives Java's regex engine will go with the one that matches first e.g. AB(\d|\d\d) will prefer to only match one digit. This issue doesn't obviously seem to apply to the example you gave though... If you use:
Do you observe the same issue? (this is a zero width positive lookahead for either a "non-word" character or the end of the document)
Thanks, I will try this (not sure when, as teaching season starts again next week!
@dan2097, If I use the following:
<regex fields="jrc" weight="1.0">NM[-]?\d\d\d[K]?(?=(\W|$))"</regex>
Then it does not find anything at all anymore...
My full workflow can be found at https://github.com/egonw/cmnanotox
NM-300K
May be useful to escape the minus sign I would use: NM-\d{3}K?
On Sun, Aug 20, 2017 at 3:02 PM, Egon Willighagen notifications@github.com wrote:
If I use the following:
NM[-]?\d\d\d[K]?(?=(\W|$))" Then it does not find anything at all anymore...
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/ami/issues/72#issuecomment-323586875, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsxSym5INgEUnpzhA6WX28ELteCf73Oks5saDxpgaJpZM4Onnvm .
-- Peter Murray-Rust Reader Emeritus University of Cambridge +44-1223-763069 and ContentMine Ltd
@egonw In your comment there appears to be a quotation mark (") before \. Is that actually there? If so that would be why it doesn't match ;-) [-] is a bit unconventional as - is also used for character ranges e.g. [1-9], but the special meaning for hyphen does only apply when it's surrounded by two characters, so [-] is fine. NM-?\d{3}K?(?=(\W|$)) Is shorter though.
Possibly related issues I found while trying things: #73, #74
@dan2097, I tried with the following file (without hits):
<compoundRegex title="jrc">
<regex fields="jrc" weight="2.0">NM-?\d{3}K?(?=(\W|$))</regex>
</compoundRegex>
Meanwhile, I decided to just move ahead and create a dedicated plugin, which gives me more control https://github.com/egonw/ami/commit/18edaa5c293eddf7aae466d140926ee1a913ed14
I have a regular expression file:
This find NM-100, etc, etc. However, it also find (ctj JSON output):
Here, I would like it to "find" NM-300K, for which the regex needs to be greedy... anyway of instructing this in the above config file?