ArchitecturalKnowledgeAnalysis / EmailDatasetBrowser

Application for interacting with datasets produced by the EmailIndexer.
MIT License
3 stars 1 forks source link

Question about Lucene queries #1

Closed wmeijer221 closed 2 years ago

wmeijer221 commented 2 years ago

Hey @andrewlalis,

I started exploring Lucene queries, however, I'm struggling with the exclusion operator. In the image you can see that I'm trying to exclude mails with "VOTE" in their subject. However, they're not excluded at all. When swapping it out with the "NOT" operator, it doesn't work either.

Is there something I'm missing, or something implementation-specific that I'm not taking into account?

Thanks in advance!

image

wmeijer221 commented 2 years ago

I'm running version 1.4.4 on Ubuntu 22.04 LTS btw; if that matters.

andrewlalis commented 2 years ago

I'm at work at the moment, so I can't fully debug the situation, but I can give some insight:

The exact place where your query is utilized for searching can be found here.

The query parser (which may be the pivotal component here) is using the parse(String query, ...) method as defined here.

Now, it may be the way in which the query parser is constructed, but I am not sure since I for the most part simply copied the indexing and search strategy from a previous bachelor project that this one builds on. If you can send me the .zip of your dataset, and some example queries and a description of what the expected search results are, then I can investigate further and try and fix the issue this evening.

wmeijer221 commented 2 years ago

I uploaded the data set + queries here: link. You'll need your RuG account to access it. Mohamed shared this data set with me; It's the same one you shared with him (iteration 3, I believe; not sure though, since I renamed it). All of the queries I added shouldn't be returning any mails with "VOTE" in them. Yet, they do.

andrewlalis commented 2 years ago

@wmeijer221 I think I have found and fixed the issue. You can try it out with version 1.4.5 of the browser app.

Just a side-note: I wasn't able to access the drive link you sent, even when attempting to access it from my a.lalis@student.rug.nl account. But anyways, when I used my iteration-3 dataset and the query issue -subject:"VOTE", I don't see any results whose subject contains the VOTE string, and using +subject:"VOTE" gives the inverse results, as expected.

For documentation's sake, this change seems to have fixed it; apparently there's some nuance in the different Field classes.

Please let me know if there are still issues, and if not, you can go ahead and close this issue.

andrewlalis commented 2 years ago

Actually I just noticed and merged your PR for improved HTML detection, so make that version 1.4.6.

wmeijer221 commented 2 years ago

Oeh, my bad for Drive. This one should work: link.

I don't think it's completely resolved yet. In the image you can see I've used a query that should exclude VOTE, however, the mail I selected does still have VOTE in it. Similarly to the queries I added to the drive example, I tried with wildcards etc., but that doesn't seem to change anything.

image

andrewlalis commented 2 years ago

Ah, I forgot to mention, you need to rebuild your dataset indexes using the new browser version. Open your dataset in the browser app, then go to File > Regenerate Indexes. This is necessary because the underlying issue was caused by an error I made when indexing the subject field (I think that in old versions I did not include this in the index at all).

wmeijer221 commented 2 years ago

Solved! Thanks Andrew!