Search: Facets as Words instead of Strings [JIRA CVRE-198]

CENDARI / editorsnotes

Note-taking environment for the CE

GNU Affero General Public License v3.0

2 stars 0 forks source link

Search: Facets as Words instead of Strings [JIRA CVRE-198] #199

Closed ghost closed 8 years ago

ghost commented 8 years ago

Following Project Facets are available in search when I filter for "Repository" facet:

europeana (4306) european (4054) library (4054) newspapers (4054) project (4054) the (4054) The European Library & Europeana - Newspapers Project (494) Europeana (221) archives (157) hub (157)

Good facets are however:

The European Library & Europeana - Newspapers Project (494) Europeana (221)

Other facets should be removed and should not be word-based, as they are provided as single string with multiple words in "project" field in the Json document.

reported by NatasaBulatovic on 2015-12-15T00:28:27.611+0100

ghost commented 8 years ago

Same problem with other Facets as well, e.g. for "Publisher" Json document contains single field (encoded UTF) :

"publisher": ["Hlutaf\u00e9lag \u00e1 Siglufir\u00f0i, 1916-1919"],

all words are created as separate facets in publisher.

NatasaBulatovic on 2015-12-15T00:32:14.609+0100

ghost commented 8 years ago

This comes from a bug in Litef that writes in elasticsearch in the wrong table/type. If it were writing in the table called 'document', the entries would be correctly indexed, but if it writes in a table/type with no mapping/schema, the entries are analyzed and split by word. There are only two types/tables that should be used in the cendari index of elasticsearch, 'document' and 'entity'. The deliverable 9.1 only describe the "document" type, which is the one Litef should use.

Jean-DanielFekete on 2015-12-22T11:15:44.858+0100

ghost commented 8 years ago

Sorry, but there is not "document" type for index in 9.1 Deliverable - it says "our top-level index in Elastic search is called CENDARI and contains all the indexed data." Next, it specifies the fields only.

Litef puts to elastic service in "/cendari/resource/resource-id" as it was not otherwise specified. To make sure Litef changes correctly, please confirm, that the index should be

/cendari/document/resource-id

whereas, I hope resource-id from NTE will never be same with the resource-id from CKAN

NatasaBulatovic on 2015-12-22T11:32:34.375+0100

ghost commented 8 years ago

On page 14 of the document, the mapping is described, along with the name of the type: "document"

Appendix A: ElasticSearch Mapping

Yes, the elasticsearch path should be: /cendari/document/resource-id

This resource id is the URI to the resource that will be accessed when the user clicks on the link. It can only be duplicated if both Litef and the NTE index the same URI, which does not happen.

Jean-DanielFekete on 2015-12-22T11:39:04.032+0100