eXist-db / exist

eXist Native XML Database and Application Platform
https://exist-db.org
GNU Lesser General Public License v2.1
429 stars 179 forks source link

ft:search() returns all index fields (even those that don't match the search terms) #991

Open rvdb opened 8 years ago

rvdb commented 8 years ago

When querying a custom full-text Lucene index with the ft:search() function, all fields of that index are returned, regardless if they match the search term. This is tested with eXist-develop, revision a8f2b0a on Windows-64bit 7 Pro, with Oracle JDK 1.8.0_92.

Take, for example, an index created as follows:

ft:index('/db/apps/test.txt',
  <doc>
    <field name="title" store="yes">Indexing</field>
    <field name="para" store="yes">This is the first paragraph.</field>
    <field name="para" store="yes">And a second paragraph.</field>
  </doc>) 

When the "para" field of this index is queried for the search term "second" with following query:

ft:search('/db/apps/test.txt', 'para:second')

...following result is returned:

<results>
  <search uri="/db/apps/test.txt" score="4.8365855">
    <field name="para">This is the first paragraph.</field>
    <field name="para">And a <exist:match>second</exist:match> paragraph.</field>
  </search>
</results>

This demonstrates that:

  1. Only results from the correct index field ("para") are returned. (correct)
  2. Yet, all fields of the "para" type are returned, instead of only those matching the search term "second". (incorrect)

More formally, I would expect this search result:

<results>
  <search uri="/db/apps/test.txt" score="4.8365855">
    <field name="para">And a <exist:match>second</exist:match> paragraph.</field>
  </search>
</results>

This looks buggy to me. Attached is a self-contained XQuery file that creates an index, queries it, and destroys the index again. ft-search-test.txt

jensopetersen commented 8 years ago

If I am not mistaken, this is the behaviour that we should expect; see Content Extraction and Binary Resource Indexing. The function is supposed to retrieve a document and supply it with matches, and one is asked to post-process the result to get the feature you desire. Of course, one could wish a different behaviour, but I would not reckon this a bug ….

rvdb commented 8 years ago

Well, perhaps the problem is that it's somehow confusing what to expect. The example query in the documentation (ft:search("/db/apps/demo/test.txt", "para:paragraph and title:indexing")) doesn't help much, since it generates matches in all index fields and hence returns the complete document. If, as you say, ft:search() returns the complete document (and could be suggested by the function documentation: "All documents that are match by the query"), I'd expect this result for my query ft:search("/db/apps/demo/test.txt", "para:second"):

<results>
  <search uri="/db/apps/test.txt" score="4.8365855">
    <field name="title" store="yes">Indexing</field>
    <field name="para" store="yes">This is the first paragraph.</field>
    <field name="para">And a <exist:match>second</exist:match> paragraph.</field>
  </search>
</results>

i.e. the entire document (including the "title" field), with only exist:match elements around matching search terms. Instead, the "title" field is omitted since it wasn't queried, but both "para" fields are returned, even if they don't contain text matches.

Yet, the prose documentation states that "Within the search element, every field which contributed to the query result is returned".

Hence, unless I misunderstand what is meant with 'every field which contributed to the query result', the behaviour of ft:search() doesn't seem consistent with either:

Note: I've noticed this inconsistency since the content extraction demo seems to expect that only fields with text matches are returned. Instead the entire indexed documents are returned, producing many false hits. The postprocessing you referred to would require to filter out only those with embedded <exist:match> elements: $fields := $result/field[.//exist:match].

jensopetersen commented 8 years ago

Thank you for this clear analysis. The use case targeted by content extraction was the indexing of PDFs and EXIF data in images. We wanted to be able to display the hit context and retrieve the document (in the case of PDFs), displaying the page with the hit. With post-processing of the matches, this was possible. That the result contained too many fields (fields of a type that had hits, but which themselves had no hits) or too few fields (fields which had no hits) did not influence this (except that the first could be considered wasteful of resources): we could display the hit context in the hit list and retrieve the document (I don't think we actually managed to scroll down to the page with the first hit when calling up the PDF). What would you prefer to be the case? To return the complete document (I agree any existing title is crucial) or to return just matching fields?

rvdb commented 8 years ago

Ok, thanks for pointing out the wider scope. Personally, I would find it least confusing if ft:search() would just return the matching fields, since that is what you'd expect from a search. (IMO this is equally confusing as if ft:query() would return entire documents, just with <exist:match> elements around matching text.) Apologies if I still don't get the full implications of your design concerns; I'm approaching this issue from the "content extraction / binary resource indexing" demo in the demo app. There, during the indexing step (cex-trigger.xql), the page numbers are derived from the extracted HTML content, and included in the indexed fields. Later, the search script (cex.xql) retrieves these page numbers from the fields returned with ft:search(), which IMO makes it possible to link to specific pages in the PDF document with which this full-text index was associated. Since "title" fields aren't returned for searches on the "page" field, the cex.xql looks them up for each search result with the ft:get-field() function.

In other words, in the context of the "content extraction / binary resource indexing" demo, I don't see the need for returning more than the matching fields: all other information for locating the match in the associated PDF resource is present. Of course, I realize this context might be too limited; if it's indeed desirable to include more than the matching fields in ft:search() search results, I'd think that returning the entire index document, with <exist:match> flags around text matches would be less confusing than returning over-generalized (all matching and non-matching fields of a certain type), yet partial (only fields of a certain type, and no others) results. If that is documented properly, users will at least know that the search results should be post-processed to filter out unwanted index fields.

Concerning your remark:

(I don't think we actually managed to scroll down to the page with the first hit when calling up the PDF).

If you're looking for a way to link to a precise page in a PDF, that seems possible: https://helpx.adobe.com/acrobat/kb/link-html-pdf-page-acrobat.html.

I'll attach a modified version of the cex.xql script (with .txt extension to keep Github happy), which:

duncdrum commented 5 years ago

There seem to be three distinct questions here:

  1. the question of improving the documentation examples for ft:search()
  2. the question of the demo app
  3. the question of the potentially problematic implementation of ft:search

@eXist-db/core can we get a quick take on 3 so we can create corresponding tickets for demo and docs, thx.

adamretter commented 5 years ago

I think only @wolfgangmm can comment on (3)