X-lab2017 / open-digger

Open source analysis tools
https://open-digger.cn
Apache License 2.0
291 stars 86 forks source link

data:add country tags to the company #1267

Closed PureNatural closed 1 year ago

PureNatural commented 1 year ago

There are 137 companies in this dir.

I have added country information to these companies by querying Wikipedia or the company website.

This data may support the big screen.@zhicheng-ning

I have changed "America" to "United_States". Because in most map APIs America is named as the United States such as Bing map. So I think this can facilitate map visualization work. The first letter of each word in a country also needs to be capitalized.

PureNatural commented 1 year ago

I have modified the countries of two existing companies.

1) jina_ai(China -> Germany) Because the website shows the headquarters of the company is located in Berlin. image

2) openresty(China -> United States) Yichun Zhang is the president and CEO of OpenResty Inc. I find that OpenResty Inc. is located in the United States. image

frank-zsy commented 1 year ago

Great work, actually the founders of JINA AI and OpenResty Inc. are both Chinese, but the companies are really registered abroad, it is my mistake.

frank-zsy commented 1 year ago

And since we adjust country names in this commit, I think we should decide which standard we use to name the countries and how to set the file names.

I suggest we follow ISO 3166-1 which is a standard for country codes and names.

In this standard, there is no South Korea but Korea, Republic of to represent the South Korea, so we may use korea_republic_of.yml as the file name. And for all the names field in the YAML files, we should use exactly the short English name of each country. If so, we can also add this to documentation so developers can understand how this works.

WDYT?

birdflyi commented 1 year ago

I agree with @frank-zsy . Using ISO 3166-1 alpha-2 codes as file names seems to be a good choice.

The International Standard for country codes and codes for their subdivisions

The purpose of ISO 3166 is to define internationally recognized codes of letters and/or numbers that we can use when we refer to countries and their subdivisions. However, it does not define the names of countries – this information comes from United Nations sources (Terminology Bulletin Country Names and the Country and Region Codes for Statistical Use maintained by the United Nations Statistics Divisions).

Using codes saves time and avoids errors as instead of using a country’s name (which will change depending on the language being used), we can use a combination of letters and/or numbers that are understood all over the world. ...

The above content is excerpted from https://www.iso.org/iso-3166-country-codes.html. Since the names of countries(or other types of regional division) may cause unknow issues, using codes of letters as filenames can eliminate many additional responsibilities. Maintaining fields in files(or even a ISO 3166 mapping table version) is easier than file names, and we can focus more on labelling than naming and disputes of regions. In addition, the Country of Origin fileds in dbdb.io(e.g. https://www.dbdb.io/db/mariadb) also use the ISO 3166-1 alpha-2. I collected the features here. It is convenient to label and filter the multi-labelled records with alpha-2 codes.

PureNatural commented 1 year ago

I agree with @birdflyi. We should focus more on labeling than naming and disputes of regions.

And for all the names fields in the YAML files, we should use exactly the short English name of each country.

If someone wants to know the name of a country, they can get it in the YAML file.

image

frank-zsy commented 1 year ago

I think the result is quite good now, is this PR ready to merge? @PureNatural

PureNatural commented 1 year ago

I think the result is quite good now, is this PR ready to merge? @PureNatural

Yes, it can be merged.

frank-zsy commented 1 year ago

/approve

But still this PR may lead other change like cron task which depends on the region label data.