Search interface - Githubissues

jmccrae commented 7 months ago

Allow search for documents and contexts by value/text

corpus.search(lemma="cat")
corpus.text_search("cat")

Should return a generator or tuples contains document ID and annotation index

AdeDoyle commented 7 months ago

It would be useful to allow search for multiple values at once:

corpus.search(lemma="cat", pos!="noun")

From a GUI perspective, perhaps a left-side tab of check-boxes containing possible keys (lemma, pos, token), like you get on online stores to narrow down searches?

jmccrae commented 7 months ago

I think there is no plan to create a Teanga GUI as of yet, so we are just talking about implementing this in Python

Your proposal looks great, but is not valid Python... it would have to be something like

corpus.search(lemma="cat", pos=not("noun"))

We would also have to implement an 'or' search e.g.,

corpus.search(lemma="bank", pos=["verb", "noun"])

AdeDoyle commented 7 months ago

I got the impression there would be a GUI from the name of this issue. If not, that makes it easier.

One follow up question. You're using "lemma" in all your examples. Will it be possible to search for individual token forms as well?

corpus.search(token="boot", pos="noun")

Also, I suspect we'll need to implement an "and/or" search rather than just an "or" search. For example, some UD languages allow multiple nominal genders (E.g. "masculine", "neuter", and "masculine/neuter"), and you may want to find words which are either gender, or both. A simple "or" search would only return one or the other.

corpus.search(pos=["noun", "adjective"], gender=["masculine", "neuter"])

jmccrae commented 7 months ago

The keyword args (**kwargs) would be interpreted as corresponding to layers in the corpus, so your example would probably work and query the value of the annotation.

My other hack is that if the layer has no data, then the query refers to the string value, e.g., the first query would work on the corpus

_meta:
  text:
    type: characters
  token:
    type: span
    on: text
  pos:
    type: seq
    on: pos
    data: ["noun", "verb", "adjective"]

For the second query, I would interpret that as return all words that have (pos=noun OR pos=adjective) AND (gender=masculine OR gender=neuter)

One open question is how we deal with search parameters like offset and limit. If these were keywords arguments (e.g., corpus.search(offset=3), it may be ambiguous so I would suggest instead we return an object that can be further specified, e.g.,

corpus.search(pos="noun").offset(10).top(10)

jmccrae commented 6 months ago

I have been thinking about this and I am leaning towards a MongoDB-style query interface based on the kwargs

For example

corpus.search(lemma="cat", pos={"$not": "noun"})
corpus.search(lemma="cat", pos={"$in": ["noun", "verb"]})
corpus.search(lemma="cat", pos={"$nin": ["noun", "verb"]}) # nin = not in
corpus.search(lemma={"$regex": "cats?"})

The whole query can even be combined into a single dictionary

corpus.search({"$or": [{ "lemma": "cat" }, {"lemma": "dog"}]})

Does this seem

AdeDoyle commented 6 months ago

I like this a lot better. It seems more intuitive, and easier to write more powerful searches.

TeangaNLP / teanga2

Search interface #15