biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
127 stars 82 forks source link

Concordance: Use re module to look for words #726

Open otichy opened 2 years ago

otichy commented 2 years ago

It would be really useful for linguistic research to be able to query using regular expressions and possibly by annotations such as POS or normalized text. This would make Concordance more similar to the usual corpus manager keyword-in-context tools like SketchEngine, Manatee, KonText or CQP.

ajdapretnar commented 2 years ago

It's a nice idea, but Orange depends on vastly different data structures than SketchEngine. Orange is not, in essence, intended for querying corpora, but for visualization and machine learning. The services you've mentioned have indexed corpora in the background. Orange doesn't. So this is mostly a question of what each tool is intended for.

Querying by regular expressions is already enabled in Corpus Viewer (the view is not concordance, though, just a running text with highlighted words).

otichy commented 2 years ago

Sure, I did not mean to suggest that Orange should become a corpus query manager. However, the Text plugin has great appeal for textual analysis and the KWIC or Concordance is in my opinion the basic tool and should come handy for almost any text exploration and analysis. Without regexp, querying synthetic languages (unlike English) is really problematic. As you point out, Corpus Viewer already has this feature, so that made me think that perhaps adding that to Concordance might not be that difficult. But of course, I understand, this might not be in your plans.

ajdapretnar commented 2 years ago

The thing is Orange currently uses the NLTK structure which leverages tokens for building concordances. This leads to all sorts of problems, such as #320. Tokens, as you can imagine, cannot work with regular expressions, because search is not looking at the whole text, but at single words.

I agree that this would be a great added value, but it really comes down to who can implement this. Our lab is too small and project-dependent to be able to tackle larger side-tasks. 😞 I promise to think about it and see what can be done.

otichy commented 2 years ago

OK, I understand. I have now noticed you can actually achieve this with the Textable plugin, so it's not that urgent :)

Thanks!

ajdapretnar commented 2 years ago

Note for developers: try using re library for search (https://docs.python.org/3/library/re.html), find index of matches and show index +- specified range. Might work and also solve #320.