DISSINET / InkVisitor

An open-source, browser-based front-end application for the collection of complex structured data from textual resources in history and the social sciences into a RethinkDB database for further analysis.
BSD 3-Clause "New" or "Revised" License
10 stars 3 forks source link

Develop three modes of search (entities, anchors, and strings in full-text); search anchors of an entity in full-text by entity (suggester) #2159

Open davidzbiral opened 3 months ago

davidzbiral commented 3 months ago

Develop a function which will allow to input an entity through suggester and will search for the anchors of this entity in one or more full-texts (selected by Rs) - e.g. the Concept of "sect" - find me all places where it is anchored. Display all the snippets of text where this entity was anchored, with the possibility of clicking on the snippet to get to the place in Annotator from search results (while still keeping the search results on the screen).

How this should be associated with search? Probably it should be one mode of search: in search, first selecting e.g. by a tab whether I want to search:

  1. entities (default), or
  2. anchors (of specific entities - even more of them), or
  3. strings (in full texts).

Any of those three search modes should have its own specific ways of filtering down the results. Entities: as now. Anchors: filter down by Rs (i.e. limit to only some full-texts). Strings: filter down by R again.

In 1 and 2, optional checboxes (can be hidden under advanced):

Allow also advanced search by SCL, SCL-CLA, CLA (entities classified as this). This will have to be some kind of query builder. But I think we should develop it for DDB2 rather than DDB1, in order not to work on queries in a different language than we want to develop queries in - the future development of DB2 generation and operation from InkV.

No. 3 has the most development potential for the future, i.e. when implementing, expect many more corpus query functions here to come, esp. colocations (study search in a corpus manager, such as AntConc, to see what functions we will need under 3).

EDIT: I realize that users will also need to combine entity anchor and string search, so 1, 2 and 3 cannot be done as three completely separated tabs. Typically, they will want to find:

  1. Entities (optionally: of a specified entity type) whose anchored span contains the search string. Output: list of entity tags.
  2. Strings covered by an anchor of an entity/ies. (optional checkboxes: include identificates, include subclasses).
  3. Anchor-string collocations, i.e. for instance with what words does the start anchor and end anchor of "L Czech Republic" co-occur: the user inputs entities (accepts multiple), and finds strings which occur are before and after this entity's anchors in all texts, inversely sorted by frequency (this is related to #1892). Chosing the window size is mandatory, i.e. whether only 1, or up to 20. Output: list of strings with context before and context after this entity's anchor.