USGCRP / gcis

Global Change Information System
https://data.globalchange.gov
Other
20 stars 16 forks source link

Populate 'Organization Aliases' for non-native English speaking countries #462

Closed amruelama closed 7 years ago

amruelama commented 7 years ago

The list of non U.S. organizations is available in Content Tracker.

After the 'Alias' field is populated, we can add this table in our system - refer #328

lomky commented 7 years ago

Just fyi, there is no extra programmatic effort if an organization has multiple aliases. I would just need them separated somehow (i.e. separated by semicolons, one per column)

amruelama commented 7 years ago

@lomky Thanks for the heads-up. I haven't found any multiple aliases organization so far. But I will add them to different column(s) if I find any.

amruelama commented 7 years ago

There are currently 825 organizations in GCIS with countries listed, that do not speak English as native language. These organizations need QA to add aliases @Zullmira

amruelama commented 7 years ago

@lomky Are there any specifications for aliases? For example, can we use special characters or a different language script?

lomky commented 7 years ago

@amruelama We are a UTF-8 application, both at the code level and database level. As best I understand it, we can support every language defined by the Unicode standards. Here is a sampler page with examples of UTF-8 character support. It has a Tamil poem! :smile:

amruelama commented 7 years ago

This is cool! There are many Japanese organizations in our database with aliases in Japanese script. I'm glad that I don't have to transliterate Japanese now.

lomky commented 7 years ago

If we could indicate what language the alias is in, that would be very helpful down the line for semantic needs.

amruelama commented 7 years ago

These are options we have:

  1. ISO 639-1: two-letter codes, one per language for ISO 639 macro language
  2. ISO 639-2/T: three-letter codes, for the same languages as 639-1
  3. ISO 639-2/B: three-letter codes, mostly the same as 639-2/T, but with some codes derived from English names rather than native names of languages (in the following table, these differing codes are highlighted in boldface)
  4. ISO 639-3: three-letter codes, the same as 639-2/T for languages, but with distinct codes for each variety of an ISO 639 macro language

Can we use ISO 639-1 @lomky ?

lomky commented 7 years ago

@amruelama It looks like ISO 639-2 is the one we want. The Library of Congress remarks that it was developed for bibliographic purposes, which we fall under pretty nicely. In the rare case the language isn't in 2, we can use mis as indicated by the standard's special cases.

I favor the T variant, going by the consensus I found on the library of congress write up and the wikipedia article.

amruelama commented 7 years ago

@lomky Makes sense. @Zullmira We can add ISO 639-2 codes for all the aliases we are adding to the organizations - see new column.

lomky commented 7 years ago

@amruelama Sorry, it looks like I was mistaken for the purposes of our semantic usage.

The DublinCore insists on the RFC-3066 standard:

  1. ISO 639-1 2 letter
  2. ISO 639-2T 3-letter if no 639-1 exists
amruelama commented 7 years ago

So we use 2 letter code (639-1); 3 letters (639-2T) if no 2 letter code exits, correct?

Zullmira commented 7 years ago

I just checked out the wiki for the ISO 639-1/2 codes. So I will reference the wiki list for inputting the language codes in their column. Correct? @amruelama

amruelama commented 7 years ago

Apparently 100+ organizations' aliases have already been added in the identifier and name fields. We need to move these names to 'Aliases (alternate field)' and add their English names in identifier and Org_name.

amruelama commented 7 years ago

Populated the aliases and the new additions will go to prod after next code release

lomky commented 7 years ago

Reopening to serve for my actual load into the table. To be completed after next release.

lomky commented 7 years ago

Given 487 alises to populate.

Skipping unknown organization: 'springer-berlin-heidelberg'

gcis=# COPY organization_alternate_name(organization_identifier,alternate_name,language) FROM '/tmp/Organization_aliases_single_column_as_csv.csv' DELIMITER E'\t' CSV HEADER;
COPY 486
gcis=# select * from organization_alternate_name limit 5;
               organization_identifier                |                 alternate_name                  | language | deprecated | identifier
------------------------------------------------------+-------------------------------------------------+----------+------------+------------
 eth-zurich-institute-biochemistry-pollutant-dynamics | Institut für Biogeochemie und Schadstoffdynamik | de       | f          |          1
 geological-survey-japan                              | 産総研地質調査総合センター                      | ja       | f          |          2
 frontier-research-center-global-change               | 地球環境フロンティア研究センター                | ja       | f          |          3
 stockholm-university                                 | Stockholms Universitet                          | sv       | f          |          4
 banaras-hindu-university                             | काशी हिदू विश्वविद्यालय                           | hi       | f          |          5
(5 rows)
gcis=# select count(1) from organization_alternate_name;
 count
-------
   486
(1 row)