Incomplete fulltext search results

protyposis commented 2 years ago

JabRef version

5.5 (latest release)

Operating system

Windows

Details on version and operating system

Windows 10 21H2

Checked with the latest development build

[X] I made a backup of my libraries before testing the latest development version.
[X] I have tested the latest development version and the problem persists

Steps to reproduce the behaviour

When using the fulltext search with a simple single-keyword query, e.g. test, I only get partial results and a subset of expected entries containing the text test is not displayed in the search results. When I open the JabRef's Lucene index in Luke and execute the same query (content:test), it returns all related entries including those that are missing in JabRef's search results.

The library in which I experience this has 400 entries. When I create a new library and add only one of the missing entries, the fulltext search returns it as expected. When I delete large portions (e.g. 350 entries) from my 400-entry library, that missing search result also starts to appear - this does not seem related to deleting a specific (potentially problematic) entry, as it starts to appear after different random selections of entries are removed. There's also no specific threshold library size that triggers this behavior - I was able to make the result appear after cutting the library randomly down to ~40 - 70 entries.

Appendix

No response

protyposis commented 2 years ago

It depends on the query if the fulltext search returns an entry. For example, for an entry with a PDF file containing the words "test1" and "test2", it can happen that the result is displayed for query "test1" but not for "test2". A direct search within the Lucene index works in both cases.

ThiloteE commented 2 years ago

Related or maybe duplicate of: https://github.com/JabRef/jabref/issues/8428

JabRef is currently in the process of implementing a better search (see https://github.com/JabRef/jabref/pull/8356) and migrating the search syntax to Lucene (see https://github.com/JabRef/jabref/pull/8206).

ThiloteE commented 2 years ago

It will take a while, as this seems to be one of the bigger projects around.

protyposis commented 2 years ago

Yes it seems related. I have seen that issue earlier but was under the impression that it is specifically about stopword elimination and/or stemming. In my case it happens with terms (e.g. cluster) that don't seem to be affected by the query analyzer. I validated this by configuring the https://github.com/JabRef/jabref/blob/6dddf93af24328f67b3c548a2b7c43b3da327dab/src/main/java/org/jabref/model/pdf/search/EnglishStemAnalyzer.java#L17-L22 in Luke.

Is there a way so see the actual query issued to Lucene, e.g. is it logged somewhere?

Siedlerchr commented 2 years ago

@protyposis Hi, the search query and results would be visible here https://github.com/JabRef/jabref/blob/7d4916ead08e340c65dd956286ae22c44ea8cc48/src/main/java/org/jabref/logic/pdf/search/retrieval/PdfSearcher.java#L69

We are still using Lucence 8 or so I think You can debug/or add a logger statement into it. (e..g Logger.warn to make sure that the log is printed)

protyposis commented 2 years ago

Thank you, that helped me figure out the problem. A search string of e.g. test results in the parsed Lucene query path:test content:test pageNumber:test modified:test annotations:test at https://github.com/JabRef/jabref/blob/7d4916ead08e340c65dd956286ae22c44ea8cc48/src/main/java/org/jabref/logic/pdf/search/retrieval/PdfSearcher.java#L69-L70

The problem here is maxHits, which is hardcoded to 5 in the search rules, e.g. at https://github.com/JabRef/jabref/blob/7d4916ead08e340c65dd956286ae22c44ea8cc48/src/main/java/org/jabref/model/search/rules/ContainBasedSearchRule.java#L97

I haven't worked with Lucene in a long while, but it seems to me that the limit applies to each field separately, so the parsed query from above can yield 25 entries at most. Usual text queries don't match the pageNumber or modified fields, yielding 15 results max, which I can also confirm from my testing.

Now the question is, is this limitation on purpose?

This certainly prevents me from using JabRef for my use-case: finding all relevant entries out of all (or a subgroup) of entries, that contain e.g. a specific keyword. Or more generically: doing fulltext-based literature research within a library. Currently this only allows to answer whether there is any or no relevant entry.

Siedlerchr commented 2 years ago

Thanks for the investigation! @btut Can you clarify here? Was there a reasonf or the 5?

btut commented 2 years ago

Now the question is, is this limitation on purpose?

Yes it is and we had quite some discussion when implementing it. The problem here is twofold:

We do not sort search-results by the lucene score, because for the metadata-search there is no lucene score. This means that for short queries that match a lot of entries, the fulltext-search would be good for nothing because one could not tell where the best hit is.
It is also difficult to weight the importance of the metadata-fields and the fulltext results. In my opinion the metadata-results are more important. When allowing all fulltext-search results, the metadata-search results would be flooded by not-very-good fulltext-results.

I think both these issues can be solved by switching to lucene for all searches. Metadata-results can be weighted using lucene as they would be using the same querries and we can use the overall lucene score to sort the entry table. (My wish would then be to also change the display of the fulltext-search results and show them directly in the table instead of the tab in the entry editor.)

ThiloteE commented 2 years ago

Since we now have seen two quite different use-cases how full-text search could be used, but the golden middle-way only works with lucene and current Jabref also only caters to one of the two use cases, how about the following proposal:

Short term workaround for the people that don't care about weighting:

Open pull request that allows to show all fulltext-search results and removes max hits.
Do NOT merge this pull request.
People can download it from here https://builds.jabref.org/pull/ and use this version of Jabref until Lucene is implemented.
Close and remove this pull request when Lucene is implemented.

Would this work or is my idea too simplistic?

btut commented 2 years ago

Would be quite an easy solution, question is if people desiring the 'no max hits' functionality would find the branch.

protyposis commented 2 years ago

The proposal by @btut looks like a proper solution, and @ThiloteE's short-term workaround would also be very useful. Maybe the workaround could be integrated into the UI through a ternary fulltext search toggle button (off/fulltext weighted/fulltext exhaustive) or a "unweighted exhaustive" checkbox in a "experimental" preferences section.

Btw., I didn't even know there are weighted results. The library view that I have been using sorts the search results by the configured sorting of that view (e.g. by year).

btut commented 2 years ago

Hm, a setting would be the cleanest solution, but the preferences panes are kind of overloaded already.

I didn't even know there are weighted results.

It's not visible to the user. Lucene provides a score for each result and once all JabRef search is switched to lucene it would make sense to sort entries accordingly when a search is active to have the best hit at the top.

protyposis commented 2 years ago

Oh, my bad. I mistakenly understood that there already is a view where the search results are sorted by relevance when the fulltext search is used.

Anyway, as long as the search results are displayed in the default library view, I guess it makes sense from a usability perspective to keep the user-defined sort order and let the search function only act as a "drill down" into the library (e.g. when I click the header of the year column to sort by year, I assume that the subset of entries from the search results is still sorted by year - if the UI stays the same and the sorting marker is displayed on the year column).

bilderbuchi commented 1 year ago

I also have problems with the fulltext search not finding search terms in pdfs in my 780 entry library. I'm not sure I'm hitting this 5 item limit, because I get <5 overall hits.

How can I determine the state of the index (status, how many items indexed, etc)? I'm suspicious because

while my library directory is nearly 2 GB (>600 files), the lucene94 index folder is only around 500 kB.
On reindexing, the log just says "Rebuilding fulltext search index...", reports the index location twice, but does not report being finished or success status.
Also, rebuilding the index triggers CPU load for only about 3 seconds.

In combination with a selectable text in a pdf (a name, not a stop word) not generating hits makes me suspect that maybe the index is somehow incomplete/broken? I possibly have some files with too long paths (~250 chars) on Windows, so those cannot be opened, could that trip up the indexer silently? Is there debug logging or somesuch that can be enabled?

narasimhareddyputta94 commented 3 months ago

@protyposis @ThiloteE @Siedlerchr @btut

Hi everyone,

I am interested in working on this issue to improve the full-text search results in JabRef. Based on the discussion, it seems the hardcoded maxHits parameter significantly limits search results. My plan is to increase the maxHits value and make it configurable through the preferences.

Before I start, I would appreciate any additional guidance or considerations to keep in mind. Specifically:

Are there any existing design decisions or constraints I should be aware of when modifying the maxHits parameter? Is there a preferred way to integrate this configuration into the preferences? Thank you, and I look forward to contributing to this improvement.

Best regards, [narasimhareddyputta94]

Siedlerchr commented 3 months ago

@narasimhareddyputta94 Thanks for your interest, however @LoayGhreeb is currently working on improving the lucene search, so it would make better sense if you pick another issue

JabRef / jabref