Closed cessda-bitbucket-importer closed 4 years ago
Original comment by John Shepherdson (GitHub: john-shepherdson).
I’ll fix the label issues (via the linked issue), but cannot fix the behavioural ones, which will have to wait for the next maintenance phase.
Original comment by Taina Jääskeläinen.
I have made an issue to the service providers, if they have titles beginning with brackets or with single or double quotation marks.
Original comment by John Shepherdson (GitHub: john-shepherdson).
The ElasticSearch config file for each language points to the default stopword list for that language (where available):
czech, danish, german, greek, english, finnish, french, hungarian, italian, dutch, norwegian, portuguese, swedish.
Elasticsearch provides the following predefined list of stopword languages:
_arabic_
, _armenian_
, _basque_
, _brazilian_
, _bulgarian_
, _catalan_
, _czech_
, _danish_
, _dutch_
, _english_
, _finnish_
, _french_
, _galician_
, _german_
, _greek_
, _hindi_
, _hungarian_
, _indonesian_
, _irish_
, _italian_
, _latvian_
, _norwegian_
, _persian_
, _portuguese_
, _romanian_
, _russian_
, _sorani_
, _spanish_
, _swedish_
, _thai_
, _turkish_
.
So, no stopword lists are available for estonian, slovakian and slovenian
Original comment by John Shepherdson (GitHub: john-shepherdson).
1 - fixed via #154
2 - TODO (see also https://github.com/cessda/cessda.metadata.office/issues/55 and https://github.com/cessda/cessda.metadata.office/issues/56)
3 - fixed via #204
Original comment by Taina Jääskeläinen.
Adding a sub-issue number 4:
Looking at Z-A sorting, it seems that if the title starts with a small letter and not a capital letter, the sorting goes haywire. Teach system to treat small and capital letters alike?
Sometimes there is a need to have the title to start with a small letter, for instance elderLUCID: London UCL Older adults' clear speech in interaction database. Here elderLUCID is the database name.
Original comment by John Shepherdson (GitHub: john-shepherdson).
@matthew-morris-cessda Are you able to fix this? if so, please self-assign.
Original comment by Taina Jääskeläinen.
https://github.com/cessda/cessda.metadata.office/issues/56 is fixed and closed.
Original comment by Matthew Morris (GitHub: matthew-morris-cessda).
I’ve discovered the root cause:
Letters are represented by numbers by computers, for example the letter G is represented by the number 71.
This issue is caused by lowercase letters are represented with larger numbers (i.e. g is represented by the number 103). Elasticsearch sorts by these numbers by default.
This has been fixed as of https://github.com/cessda/cessda.cdc.osmh-indexer.cmm/commit/8940f3543f0a3e668c2ed8ac68dcb00e218ba45f but a reindex is required in order for the fix to take effect.
Original comment by Matthew Morris (GitHub: matthew-morris-cessda).
[link to pull request removed](link to pull request removed)
Original comment by Matthew Morris (GitHub: matthew-morris-cessda).
Original report on BitBucket by Taina Jääskeläinen.
Alphabetical ordering by titles: some issues.