dainst / ariadne-portal

MIT License
0 stars 1 forks source link

Rationalise Filters offered on main results page #189

Closed dstudhope closed 7 years ago

dstudhope commented 8 years ago

We should consider the Filters on left of main Results Page and omit some of the filters offered. Some are counter productive without data cleansing. Some are not useful for search anyway. The underlying metadata can still be available on a single resource result page if considered useful.

The situation is made worse by the case sensitive search on filter selection (if eg a native subject selected as a move in a faceted search) but that is only part of the problem. As it stands, some of the filter options just serve to highlight the lack of data cleansing (link to issue 179).

eg Native subject search is case sensitive: http://portal.ariadne-infrastructure.eu/search?nativeSubject=House [13,600 records] http://portal.ariadne-infrastructure.eu/search?nativeSubject=HOUSE [72,585 records]

Keyword search is case sensitive: http://portal.ariadne-infrastructure.eu/search?keyword=archaeology [16 records] http://portal.ariadne-infrastructure.eu/search?keyword=Archaeology [39,859 records]

Place is case sensitive: http://portal.ariadne-infrastructure.eu/search?spatial=LONDON [2 records] http://portal.ariadne-infrastructure.eu/search?spatial=London [48 records]

There are many variations of essentially the same place. Ideally standardisation via a gazetteer would be applied. Places need significant data cleansing anyway, as do languages: http://portal.ariadne-infrastructure.eu/search?language=en [1,687,085 records] http://portal.ariadne-infrastructure.eu/search?language=eng [724 records] http://portal.ariadne-infrastructure.eu/search?language=english [6 records] language shows also an inappropriate set of options, including "resource.language.summary"

Dating includes raw terms such as secc. V1 a.C Middle Ages, MEDIEVAL RECENT sec. VI a.C.;DTSI=599 a.C.;DTSF=550 a.C. Ideally some standardisation could be applied

Rights just seems to be an index of word frequencies from the copyright wording? Not useful anyway! eg offering filter options on and for has is

Doug (and Ceri)

jfihn commented 7 years ago

@eafiontzi Do we have normalization of Keyword and Native Subject in MORe, i.e. case normalization?

eafiontzi commented 7 years ago

No, there was no need for these kind of filters inside MoRe so far. We are working on the Native Subject with the terms of exclusion now, I hope afterwards the results are satisfying.

eafiontzi commented 7 years ago

Regarding the case sensitivity, it is a bit more complex, because some words may need the first capital letter while others may not, For example in the native subject we have "Field Visit", that would be done "Field visit"? And if we have a subject that contains a town name or a country name would it be lowercase?

What about places? Should "San Severino Marche" be "San severino marche"? And if we choose that every word has the first capital letter should "San Benedetto del Tronto" be "San Benedetto Del Tronto"? There are many subjects with the word "and" as well.

I believe we need to make a general desicion before starting to republish the content

jfihn commented 7 years ago

@dstudhope Any insights on Eleni's comment?

borsna commented 7 years ago

I agree with @eafiontzi. It will look alot nicer if we se the same method for doing lowercase (perhaps with uppercase on the first character).

This will mess upp things like "The University of Gothenburg" but i think it will be alot nicer from a search/filter point of view to get identical subject harminized.

cbinding commented 7 years ago

You don’t have to republish content or change any of the actual data items to achieve case insensitive search, the resource data labels can be different to the data actually used within the index. ElasticSearch already has the facilities to do all of this, the ‘search analyzer’ needs to be configured to use lowercase terms, and to normalize whitespace and punctuation, address character encoding issues and any other issues that might affect the search results. Recreate the inverted index using the custom analyzer, and use the same custom analyzer during search to ensure correct results. Also if you're using 'term' queries rather than 'match' queries then it's doing an exact match, ignoring any analyzer. See: https://medium.com/@lefloh/elasticsearch-and-case-sensitive-term-queries-6f6c516aebed#.nt7vk2w7r and https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

eafiontzi commented 7 years ago

That would solve the problem for searching terms, but the issue appears mostly in aggregation buckets. The only way I have found to go round it is to add an extra analyzer in the ES mapping (you can see here https://qbox.io/blog/elasticsearch-aggregation-custom-analyzer) with lowercase filter. This would cause two issues:

cbinding commented 7 years ago

OK, when you mentioned republishing the content you meant recreating the ElasticSearch index, yes I agree this definitely needs to be done, using a custom analyzer to fix the case sensitivity. I would suggest also that uppercase filters should resolve the concerns about place names in the earlier comments; there is a lot of uppercase data present already.

eafiontzi commented 7 years ago

So you suggest using a lowercase filter for some filters and an uppercase filter for others in conjuction with the not_analyzed index? I am not sure how good it will look to have some filters in lowercase and others in uppercase. In all cases republishing all the data (~1,842,329 resources) in a new index will surely take up a lot of time.

cbinding commented 7 years ago

No, of course just standardize one way or the other, so uppercase for everything to resolve the place name issues you highlighted. I don't know exactly how long regenerating the index would take, as it depends on many factors but in my own work on a desktop PC it is typically only a matter of minutes. 2 million records does not sound a lot at all, and it is absolutely required.

eafiontzi commented 7 years ago

Ok, so if we all aggree with having only uppercase for the filters "Subject" (Derived Subject), "Original Subject" (Native Subject), "Keyword" and "Dating" we can move forward with this. "Place" is removed from the filters in the latest version of the portal. "Publisher" and "Contributor" should be ok the way they are now?

Creating the new index with the proper mapping should take just a few seconds, though republishing all the content will need a few days, because it must be done through MoRe, but if this is a neccesity it's a toll we can take. ;) Like before we will create a new index, do the appropirate checks and then change the alias of the current index when everything is in place.

Are we in aggrement with this plan?