biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
127 stars 82 forks source link

RegExp for matching two exact words in Statistics #1010

Open abocin opened 11 months ago

abocin commented 11 months ago

What's your use case?

For Statistics when I use the Contains feature "word" for searching specific words, it returns one or more entries and within which document the word is located. However, when I use the RexExp feature from statistics filling in for example a simple search string \bword\b, the search returns 0 results.

This is not a major issue as I can simply use the Contains feature to identify how many times the "word" can be found but whenever I want to find a group of two (key)words the Contains feature fails to return any results, i.e., Contains box I introduce "word1" space "word2" despite that these two words in the exact order exists in the text within the document I have. When I tried to use RegExp because Contains seems to not fit for such search, the RegExp expression seems to not work either... I used many RegExp from the simple /^(apple|banana)$/, to (apple|banana), apple|banana, \b(apple|banana)(?:\W+\w+){1,6}?\W+(apple|banana)\b.

My task is quite simple. I need to find some keywords in a document but sometimes these keywords are actually a group of two (or more) words that define the concept. For example, I want to find within the documents I have all the sentences that contain the "apple banana" group of words. Preferable with a space between them but it can be also found within a length of six words for example (see the RegExp example I gave above). I don't know exactly what input should be provided in the RegExp field from Statistics from the Text Mining add-in.

Maybe some examples would be useful. I have the documentation where RegExp is mentioned in other areas such as Corpus View or Preprocess Text and neither there I was able to summon the RegExp for two or more words.

What's your proposed solution?

Can you please provide the exact format of the RegExp input and which format or style for RegExp should be used in order to return valid search results for a group of two or more words "word1" space "word2" space "word3".

Are there any alternative solutions?

ajdapretnar commented 11 months ago

The issue is that Regex searches in tokens, which are by default constructed as 1-grams. Ideally, regex would look in the text, not tokens. We will think about a better solution for this.