enonic / xp

Enonic XP
https://enonic.com
GNU General Public License v3.0
202 stars 34 forks source link

Stemming only works on allText in content layer #8876

Open ComLock opened 3 years ago

ComLock commented 3 years ago

I have tried https://github.com/ComLock/app-stemming-example on XP 7.6.1 and stemming doesn't work.

Current docs I can find: https://developer.enonic.com/docs/xp/stable/storage/indexing#stemmed https://developer.enonic.com/docs/xp/stable/storage/noql#stemmed

Future doc supposed to work: https://developer.enonic.com/docs/xp/next/storage/indexing#languages

https://github.com/ComLock/app-stemming-example/blob/master/src/main/resources/main.es#L13-L18 fulltext: true, // Needed for stemming? includeInAllText: true, // Needed for stemming? languages: ['no'] stemmed: true // Not reflected in node, nor in documentation, so this some core dev must have told me?

https://github.com/ComLock/app-stemming-example/blob/master/src/main/resources/main.es#L114 query: stemmed('_allText', '${word}', 'OR', 'no')

When using data toolbox I can see there is no _alltext._stemmed_no or any other fieldName._stemmed_no under Display Search Index Document

Also when exporting and inspecting node.xml the node.xml doesn't look anything like the _indexConfig. So export format is wastly different from node JSON. Why???

When making a normal content site and setting language to "no" there is a _alltext._stemmed_no But when looking at the node JSON there are no languages set under _indexConfig!

This is what I think makes stemming work for content in the exported node.xml

<allTextIndexConfig>
    <languages>
        <language>no</language>
    </languages>
</allTextIndexConfig>
vbradnitski commented 3 years ago
rymsha commented 3 years ago

Documentation issue created https://github.com/enonic/doc-xp/issues/307

ComLock commented 3 years ago

@sigdestad Can you have a look at this. If I understand correctly nothing has changed, stemming is still impossible for the node layer?

sigdestad commented 3 years ago

@rymsha something is obviously not working as expected. Could we arrange a meeting to clarify things related to stemming for med and CWE?

alansemenov commented 3 years ago

@ComLock from what I see here, you are still trying to stem the _name field. This will not work, as Slava described in the comment above (and this is what he fixed in the docs too).

If you have an example where stemming doesn't work for a field that is supposed to work, please commit this to your app's repo and we will look at the code.

ComLock commented 3 years ago

I have now updated the example, and I can see there is a stemmed index , but still no hits. https://github.com/ComLock/app-stemming-example/blob/master/src/main/resources/main.es#L47

"property._stemmed_no": [
  "havnedistriktene"
],
ComLock commented 3 years ago

Are there automatic regression tests for stemming on the node layer somewhere, it would be nice to look at some working example code. (since my example code has flaws in it)

ComLock commented 3 years ago

custom index cannot be created for the _name field, it's always fulltext

What does that mean? You can't both have a fulltext and stemmed index of a field? So I have to set fulltext to false in indexConfig, in order for stemming to work? Nah it works with fulltext: true

ComLock commented 3 years ago

Got it working, looking into why. Maybe connection.refresh();

ComLock commented 3 years ago

So stemmed('_allText') function will return nothing for node.

@sigdestad If I understand correctly even though I say includeInAllText: true on some field, there will be no _alltext._stemmed_no. I can live with that in explorer. But might be something we want in the future?

sigdestad commented 3 years ago

So, includeInAllText just indicates if a property should be included when creating the _allText "virtual field". What we need is that _allText get's stemmed.

ComLock commented 3 years ago

Something like: If any field indexConfig (including default) both have includeInAllText: true and any lang in languages: [] then there needs to be created an _alltext._stemmed_LANG per LANG

vbradnitski commented 3 years ago

includeInAllText and languages are separate configs: includeInAllText indicates if mapped field/fields value will be add to _allText index and languages sets array of stemmed indices for only mapped field/fields. Adding stemmed index for special _allText field was designed as a content level thing only for single language content field value.

So if you create a node by node-lib then the only way to add stemmed indices is to set it for particular fields with languages property for now.

sigdestad commented 3 years ago

But doesn't content API use node API directly? How can it do something else than what node api supports?

sigdestad commented 3 years ago

Basically, what I would like to know is how to specify that -_allText should be stemmed?

vbradnitski commented 3 years ago

yes, content API uses node API, but _allText stemming feature is done in the inner core-content module, and this functionality wasn't been open neither in content or node js lib. It was decided to make it a content-specific feature. So there is no way to influence _allText indexing directly for now.

sigdestad commented 3 years ago

Ok, so we basically need to define a proper solution for this in the node API, implement and document it.

There is at least one known "problem" with this in the export/import xml that we also need to look into

rymsha commented 3 years ago

It is still not exactly clear what to do...

sigdestad commented 3 years ago

this -> There is no way of specifying stemming for _allText via node API import/export -> currently uses an undocumented obscure format for transferring stemming info (at least for _allText) - Need to consider if it is breaking or not...

ComLock commented 9 months ago

I don't care that much about the import/export syntax.

But since Explorer 4 uses _alltext as the default search, it would be nice if stemming of _alltext on the node layer worked from the get go without people having to boost/query specific stemmed fields in order for stemming to work...

So I think the label of Documentation is misleading. For me this is rather a Feature Request or similar.