cessda / cessda.cdc.versions

Issue track and wiki for the CESSDA Data Catalogue
https://datacatalogue.cessda.eu/
Apache License 2.0
0 stars 0 forks source link

Alphabetical order issues in sorting #171

Closed cessda-bitbucket-importer closed 4 years ago

cessda-bitbucket-importer commented 4 years ago

Original report on BitBucket by Taina Jääskeläinen.


Alphabetical ordering by titles: some issues.

  1. The labels are the wrong way round, I think (A-Z is actually Z-A).
  2. Goes haywire in the A-Z if there are ' , “, ( or * or a 4-letter numerical (year) in front of the title. Can you teach the system to ignore these easily? I will anyway at some point make issues for these for SPs, except for years which are allowed.
  3. What about ‘A' and ‘The’, does English sort by them of by the first ‘real’ word in the title? If the latter, the system should ignore the A and The.

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


See also #154

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


I’ll fix the label issues (via the linked issue), but cannot fix the behavioural ones, which will have to wait for the next maintenance phase.

cessda-bitbucket-importer commented 4 years ago

Original comment by Taina Jääskeläinen.


I have made an issue to the service providers, if they have titles beginning with brackets or with single or double quotation marks.

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


The ElasticSearch config file for each language points to the default stopword list for that language (where available):

czech, danish, german, greek, english, finnish, french, hungarian, italian, dutch, norwegian, portuguese, swedish.

Elasticsearch provides the following predefined list of stopword languages:

_arabic__armenian__basque__brazilian__bulgarian__catalan__czech__danish__dutch__english__finnish__french__galician__german__greek__hindi__hungarian__indonesian__irish__italian__latvian__norwegian__persian__portuguese__romanian__russian__sorani__spanish__swedish__thai__turkish_.

So, no stopword lists are available for estonian, slovakian and slovenian

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


1 - fixed via #154

2 - TODO (see also https://github.com/cessda/cessda.metadata.office/issues/55 and https://github.com/cessda/cessda.metadata.office/issues/56)

3 - fixed via #204

cessda-bitbucket-importer commented 4 years ago

Original comment by Taina Jääskeläinen.


Adding a sub-issue number 4:

Looking at Z-A sorting, it seems that if the title starts with a small letter and not a capital letter, the sorting goes haywire. Teach system to treat small and capital letters alike?

Sometimes there is a need to have the title to start with a small letter, for instance elderLUCID: London UCL Older adults' clear speech in interaction database. Here elderLUCID is the database name.

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


@matthew-morris-cessda Are you able to fix this? if so, please self-assign.

cessda-bitbucket-importer commented 4 years ago

Original comment by Taina Jääskeläinen.


https://github.com/cessda/cessda.metadata.office/issues/56 is fixed and closed.

cessda-bitbucket-importer commented 4 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


I’ve discovered the root cause:

Letters are represented by numbers by computers, for example the letter G is represented by the number 71.

This issue is caused by lowercase letters are represented with larger numbers (i.e. g is represented by the number 103). Elasticsearch sorts by these numbers by default.

This has been fixed as of https://github.com/cessda/cessda.cdc.osmh-indexer.cmm/commit/8940f3543f0a3e668c2ed8ac68dcb00e218ba45f but a reindex is required in order for the fix to take effect.

cessda-bitbucket-importer commented 4 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


[link to pull request removed](link to pull request removed)

cessda-bitbucket-importer commented 4 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


171 - Resolved.PNG

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Checked using Swedish alphabet