aryn-ai / remote-processor-service

Service for hosting remote processors
Apache License 2.0
0 stars 0 forks source link

Document properties aren't present unless query requests them. #4

Open alexaryn opened 8 months ago

alexaryn commented 8 months ago

For near-duplicate detection, the array of shingles attached to each document is necessary for the dedup processor to do its work. In reality, the shingles are an implementation detail and users of NDD may not and should not need to know that they exist. Nevertheless, if the _source part of the query doesn't list shingles then the whole feature breaks down.

I'm not sure this is exactly a bug, but it would be great if search processors could see "everything" about each document in order to do their work. For instance, it might be useful to have modification date, too, as one possible way of deciding which near-duplicates remain in the results.

HenryL27 commented 8 months ago

I don't think opensearch records are timestamped by default... but, weird. Do you have an example of how you loaded the index so I can repro?

HenryL27 commented 8 months ago

this is an issue with all search response processors: "_source" is computed in the fetch phase, which occurs strictly before the execution of the response pipeline. There's not really a good way to 'request a document field' short of specifying in _source (or not including _source and effectively select *'ing)