Closed amruelama closed 7 years ago
Just fyi, there is no extra programmatic effort if an organization has multiple aliases. I would just need them separated somehow (i.e. separated by semicolons, one per column)
@lomky Thanks for the heads-up. I haven't found any multiple aliases organization so far. But I will add them to different column(s) if I find any.
There are currently 825 organizations in GCIS with countries listed, that do not speak English as native language. These organizations need QA to add aliases @Zullmira
@lomky Are there any specifications for aliases? For example, can we use special characters or a different language script?
@amruelama We are a UTF-8 application, both at the code level and database level. As best I understand it, we can support every language defined by the Unicode standards. Here is a sampler page with examples of UTF-8 character support. It has a Tamil poem! :smile:
This is cool! There are many Japanese organizations in our database with aliases in Japanese script. I'm glad that I don't have to transliterate Japanese now.
If we could indicate what language the alias is in, that would be very helpful down the line for semantic needs.
These are options we have:
Can we use ISO 639-1 @lomky ?
@amruelama It looks like ISO 639-2 is the one we want. The Library of Congress remarks that it was developed for bibliographic purposes, which we fall under pretty nicely. In the rare case the language isn't in 2, we can use mis
as indicated by the standard's special cases.
I favor the T variant, going by the consensus I found on the library of congress write up and the wikipedia article.
@lomky Makes sense. @Zullmira We can add ISO 639-2 codes for all the aliases we are adding to the organizations - see new column.
@amruelama Sorry, it looks like I was mistaken for the purposes of our semantic usage.
The DublinCore insists on the RFC-3066 standard:
- ISO 639-1 2 letter
- ISO 639-2T 3-letter if no 639-1 exists
So we use 2 letter code (639-1); 3 letters (639-2T) if no 2 letter code exits, correct?
I just checked out the wiki for the ISO 639-1/2 codes. So I will reference the wiki list for inputting the language codes in their column. Correct? @amruelama
Apparently 100+ organizations' aliases have already been added in the identifier and name fields. We need to move these names to 'Aliases (alternate field)' and add their English names in identifier and Org_name.
Populated the aliases and the new additions will go to prod after next code release
Reopening to serve for my actual load into the table. To be completed after next release.
Given 487 alises to populate.
Skipping unknown organization: 'springer-berlin-heidelberg'
gcis=# COPY organization_alternate_name(organization_identifier,alternate_name,language) FROM '/tmp/Organization_aliases_single_column_as_csv.csv' DELIMITER E'\t' CSV HEADER;
COPY 486
gcis=# select * from organization_alternate_name limit 5;
organization_identifier | alternate_name | language | deprecated | identifier
------------------------------------------------------+-------------------------------------------------+----------+------------+------------
eth-zurich-institute-biochemistry-pollutant-dynamics | Institut für Biogeochemie und Schadstoffdynamik | de | f | 1
geological-survey-japan | 産総研地質調査総合センター | ja | f | 2
frontier-research-center-global-change | 地球環境フロンティア研究センター | ja | f | 3
stockholm-university | Stockholms Universitet | sv | f | 4
banaras-hindu-university | काशी हिदू विश्वविद्यालय | hi | f | 5
(5 rows)
gcis=# select count(1) from organization_alternate_name;
count
-------
486
(1 row)
The list of non U.S. organizations is available in Content Tracker.
After the 'Alias' field is populated, we can add this table in our system - refer #328