Alina-enni / lingdiggers

Project for the Building NLP Applications course
0 stars 0 forks source link

Refined search options #16

Closed miglamigla closed 2 years ago

miglamigla commented 2 years ago

Implement some of the ideas below to improve your search application. You can build on your Boolean search engine or your Tf-idf based search engine or a hybrid of both. You can also save some of the ideas for later, for the final application that you deliver at the end of the course.

a. Stemming: Your search engine will not only find documents, in which your search terms are exact matches, but other inflected forms will match as well. For instance, if you work on English data, a search for house will find documents containing house, houses, housing etc. Similarly a search for houses will return the same documents.

b. Stemming and exact match: A more challenging task would be to support both stemming and exact matches. For instance, you could run exact matching on terms enclosed in double quotes, such as "house", and stemming on terms without quotes, such as house. You could either expect the whole query to be of either type or you could allow a mix of both types in one single query, such as clean "house".

c. Multi-word phrases: Often it is useful to find contiguous phrases of words rather than just words anywhere in a document. A user searching for "New York" may not expect to find documents where new and York both occur, but separately. You could support phrases either as exact matches or with stemming. There are multiple ways of solving this. You can look at the documentation of the CountVectorizer and TfidfVectorizer to get some ideas. You can also come up with your own approach.

d. Wildcard searches: Let the users search on incomplete terms, such as hous (easiest) or ing (similar to previous case) or h*ing (hardest). Read Chapter 3 of the book to learn more about this topic.

e. Suggest your own idea: Discuss more ideas with the teachers.

miglamigla commented 2 years ago

Tried to make a wildcard search but only for the tf-idf search option. At the moment it searches only for a beginning of a certain word. But if the beginning coincides with the full word, it does not find it. E.g. if I search for "asp", it would give various words that start with this (aspects, aspirations, aspen, asperger, etc.; but searching for "aspen" does not give anything. I also left comments I made for myself in the code, but if they get in the way - feel free to delete them. Looks like it is also possible to search not only for the beginnings of the word, if we take the caret out of the search function in line 117. It would then look for the input anywhere in the word.

I'm still not sure if it's the right way to go about wildcard searches. Is it something you would like to pursue or should we try some other options?

miglamigla commented 2 years ago

Some version of a wildcard search is working, so closimg the issue