cessda / cessda.cdc.versions

Issue track and wiki for the CESSDA Data Catalogue
https://datacatalogue.cessda.eu/
Apache License 2.0
0 stars 0 forks source link

Cleaning the country filter with ISO code #214

Closed cessda-bitbucket-importer closed 3 years ago

cessda-bitbucket-importer commented 4 years ago

Original report on BitBucket by Taina Jääskeläinen.


@john-shepherdson Test in dev version how many records would be left in the country filter if only those records which have a ISO two-letter code (included in the CV) in nation element were included.

Use country name and ISO code vocabulary for this testing. It also gives the country name in English for the filter.

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


see also #30

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


See Countries tab of https://docs.google.com/spreadsheets/d/1Nv5BWA-Cy8ToO7qxd5HqrerDhFL0AGIox5vzV7rABjg/edit#gid=697678413

cessda-bitbucket-importer commented 4 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


The Java standard library contains the Locale.getISOCountries(Locale.IsoCountryCode.PART1_ALPHA2) method which lists all two letter country codes. This may be suitable versus defining our own ISO code list.

We can extend the extraction of the two letter code with filter(code -> Locale.getISOCountries(Locale.IsoCountryCode.PART1_ALPHA2).contains(code)) which will filter out invalid ISO labels;

cessda-bitbucket-importer commented 4 years ago

Original comment by Taina Jääskeläinen.


Easier to use the locale than to produce and maintain our own list.

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


At some point, don’t we want the SPs to use a CV of Countries when creating their metadata records so we get consistent code values entered?

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


Matthew, at this moment we do not know how many records have an ISO code for country. Therefore, before deciding whether to clean the filter using the ISO code and corresponding country names, we would need to find out first how many records would be left in the filter, that is, whether this is a feasible solution at all. So this isssue is about testing.

Just for this testing, could you use the CV? But if it is quicker to do it with the Java library locale, please use that.

We’ll think of next steps when we see how many records are left.

cessda-bitbucket-importer commented 3 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


I’ve introduced some logging to see how many records have valid ISO country codes. This currently only considers 2 character country codes and uses the Java standard library to perform the comparison.

Not having a country element will also fail this check.

These checks are done at a language level, i.e. if a record has 3 languages of which 2 have country elements that have valid ISO codes, this check will return 2.

Repositories that failed (i.e. returned no records) have been omitted.

[SASD] 6 studies out of 8 have valid ISO country codes

[APIS] 15 studies out of 15 have valid ISO country codes

[SoDaNet] 37 studies out of 40 have valid ISO country codes

[ProgedoSciencesPo] 0 studies out of 330 have valid ISO country codes

[UniData] 0 studies out of 76 have valid ISO country codes

[SND] 1130 studies out of 1374 have valid ISO country codes

[AUSSDA] 0 studies out of 729 have valid ISO country codes

[ADP] 660 studies out of 758 have valid ISO country codes

[FSD] 3094 studies out of 3102 have valid ISO country codes

[DANS] 0 studies out of 4188 have valid ISO country codes

[DNA] 0 studies out of 3412 have valid ISO country codes

[UKDS] 0 studies out of 8746 have valid ISO country codes

[GESIS] 7404 studies out of 12342 have valid ISO country codes

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Please add stats for CSDA, NSD and SODHA

cessda-bitbucket-importer commented 3 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


[SODHA] 0 studies out of 0 have valid ISO country codes

[NSD] 2097 studies out of 2258 have valid ISO country codes

[CSDA] 0 studies out of 0 have valid ISO country codes

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Looking at staging instance, CDSA has 881 studies in English and SODHA has 40 studies in English.

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Need to log a warning if a records does not have a valid 2 letter country code.

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


In the User Group meeting, we made the final decision that the country filter must be cleaned with using the ISO codes. We will inform service providers.

Matthew to take a look at what kind of country names are in the Java library or elsewhere in the net, for transforming the codes into country names in English for the filter. We want to have the country names in the filter in the short form, for instance, ‘Congo’ instead of ‘The Democratic Republic of Congo’ or at least in the form of ‘Congo, the Democratic Republic of Congo’, so that alphabetical list makes sense.

So country names more like here: https://en.wikipedia.org/wiki/ISO_3166-1

or like in this CESSDA CV for country codes and names:

https://vocabularies.cessda.eu/editor/vocabulary/CountryNamesAndCodes?lang=en

Of course it is easier if there is a functional list maintained elsewhere that the system can use and we can refer to in metadata instructions.

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


@matthew-morris-cessda could you please take a look at these country codes and names lists. I have a webinar with CESSDA metadata contacts on Thursday and if possible, would like to say something about this,

cessda-bitbucket-importer commented 3 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


I’ve create a PR with the filter implemented.

I decided not to use the Java Standard Library because of unexpected name choices (e.g. Hong Kong becoming Honk Kong SAR China, which is not an ISO name).

Because of this, I’ve chosen to use https://github.com/TakahikoKawasaki/nv-i18n as the source for the country names.

[link to pull request removed](link to pull request removed)

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


How does this github handle the GB/UK issue, or EL/GR issue for Greece? Which code or do both codes (GB and UK) give United Kindgom as the country name, and EL and GR Greece?

Can I see somewhere the list what the country names are?

Have you implemented this in staging yet?

cessda-bitbucket-importer commented 3 years ago

Original comment by Matthew Morris (GitHub: matthew-morris-cessda).


The changes have been merged.

Regarding the GB/UK issue, only the registered ISO codes will be accepted. These codes must have the correct capitalisation (i.e. DE would be accepted, but de would not).

This change has also been implemented in staging.

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


So you mean GB for United Kingdom and GR for Greece? Just checking, for the CDC Information webinar tomorrow.

cessda-bitbucket-importer commented 3 years ago

Original comment by Taina Jääskeläinen.


This can be closed, the filter is working like it should. Making another issue of not changing the country metadata in detailed study view.