Open jmccrae opened 7 months ago
It would be useful to allow search for multiple values at once:
corpus.search(lemma="cat", pos!="noun")
From a GUI perspective, perhaps a left-side tab of check-boxes containing possible keys (lemma, pos, token), like you get on online stores to narrow down searches?
I think there is no plan to create a Teanga GUI as of yet, so we are just talking about implementing this in Python
Your proposal looks great, but is not valid Python... it would have to be something like
corpus.search(lemma="cat", pos=not("noun"))
We would also have to implement an 'or' search e.g.,
corpus.search(lemma="bank", pos=["verb", "noun"])
I got the impression there would be a GUI from the name of this issue. If not, that makes it easier.
One follow up question. You're using "lemma" in all your examples. Will it be possible to search for individual token forms as well?
corpus.search(token="boot", pos="noun")
Also, I suspect we'll need to implement an "and/or" search rather than just an "or" search. For example, some UD languages allow multiple nominal genders (E.g. "masculine", "neuter", and "masculine/neuter"), and you may want to find words which are either gender, or both. A simple "or" search would only return one or the other.
corpus.search(pos=["noun", "adjective"], gender=["masculine", "neuter"])
The keyword args (**kwargs
) would be interpreted as corresponding to layers in the corpus, so your example would probably work and query the value of the annotation.
My other hack is that if the layer has no data, then the query refers to the string value, e.g., the first query would work on the corpus
_meta:
text:
type: characters
token:
type: span
on: text
pos:
type: seq
on: pos
data: ["noun", "verb", "adjective"]
For the second query, I would interpret that as return all words that have (pos=noun OR pos=adjective) AND (gender=masculine OR gender=neuter)
One open question is how we deal with search parameters like offset and limit. If these were keywords arguments (e.g., corpus.search(offset=3)
, it may be ambiguous so I would suggest instead we return an object that can be further specified, e.g.,
corpus.search(pos="noun").offset(10).top(10)
I have been thinking about this and I am leaning towards a MongoDB-style query interface based on the kwargs
For example
corpus.search(lemma="cat", pos={"$not": "noun"})
corpus.search(lemma="cat", pos={"$in": ["noun", "verb"]})
corpus.search(lemma="cat", pos={"$nin": ["noun", "verb"]}) # nin = not in
corpus.search(lemma={"$regex": "cats?"})
The whole query can even be combined into a single dictionary
corpus.search({"$or": [{ "lemma": "cat" }, {"lemma": "dog"}]})
Does this seem
I like this a lot better. It seems more intuitive, and easier to write more powerful searches.
Allow search for documents and contexts by value/text
Should return a generator or tuples contains document ID and annotation index