ckan / ckanext-xloader

Express Loader - quickly load data into DataStore. A replacement for DataPusher.
GNU Affero General Public License v3.0
44 stars 50 forks source link

Japanese kanjis in CSV header automatically replaced into Chinese Pinyins #145

Open aruneko opened 2 years ago

aruneko commented 2 years ago

When I uploaded a CSV file on a datastore via XLoader, I noticed that all Japaneses kanjis in the CSV header are replaced into Chinese Pinyins.

For example, when a XLoader loads such CSV file, we get this table.

id,緯度,経度
1,35.6824572,139.763119
2,35.6733858,139.7506726
id Wei Du Jing Du
1 35.6824572 139.763119
2 35.6733858 139.7506726

Note: The pronunciation of 緯度 is /ido/ and the pronunciation of 経度 is /ke:do/ in Japanese. So you can notice these are quite far from the Chinese one.

But I want to get a table like this. So, those are should not be replaced.

id 緯度 経度
1 35.6824572 139.763119
2 35.6733858 139.7506726

I know unidecode does that in this function. But I don't have good idea not to encode Kanjis in a CSV header.

KatiRG commented 1 year ago

The application of unidecode() to CSV headers here https://github.com/ckan/ckanext-xloader/blob/master/ckanext/xloader/loader.py#L355 will also strip off French accents, which is not good for our French users. The PyPi docs recommend :

It’s best to not use Unidecode for strings that are directly visible to users of your application.

Why is unidecode() applied by default in xloader?

wardi commented 1 year ago

It's being used because these are used as postgres column names in the datastore database, which can be difficult to access for users not used to entering non-ascii text via the API.

For preview it is possible to add the original names to the data dictionary, but that's an extra step right now.

We should experiment with not converting these column headings, or only filtering specific characters known to cause problems. Have you tried removing this conversion to see if these column names cause any problems?

KatiRG commented 1 year ago

Thanks @wardi , yes I have experimented with removing unidecode() on the header names and this preserves the original names, which is good for us.

aimalkhan commented 1 year ago

@wardi

It's being used because these are used as postgres column names in the datastore database, which can be difficult to access for users not used to entering non-ascii text via the API.

what are some examples of this difficulty in access?

Have you tried removing this conversion to see if these column names cause any problems?

for the test resources we are using, there are no issues. However, there could be issues for other characters not yet encountered. Is there a test dataset that has been used previously (such as that would have lead to the decision of using unidecode()) that we can test with?

wardi commented 1 year ago

The difficulty I'm thinking of is for people without input methods set up to enter non-ascii characters reliably, and the ambiguity of some Unicode characters. It's not a blocker just a possible impediment.

If we update our installation docs to make sure that the databases are created with a UTF-8 encoding then we should be able to use any language in the column names that postgres supports and I think removing unidecode (or putting it behind a configuration option) would be a great change for this repo.

For testing names we have to follow postgres restrictions for column naming and be aware of case-folding behavior.

aimalkhan commented 1 year ago

We can make a PR to make the unidecode() use here headers = [unidecode(header) for header in headers] controllable from the .ini file, which is turned on by default. Will that suffice as an acceptable change?

update our installation docs to make sure that the databases are created with a UTF-8 encoding

The documentation does mention and provide steps to use UTF-8 encoding, so that is taken care of.

column naming and be aware of case-folding

afaik the postgres naming restrictions are already being followed, and the case-folding behavior won't change with the proposed PR.

wardi commented 1 year ago

@aimalkhan sounds good, but we'll need to check that the column name will be valid as a postgres column name (at a minimum no " characters but I'm sure that rules are more strict than that.

@jqnatividad you did some work on code that validated column names for datapusher+ recently, right? Maybe we can borrow that.

@aimalkhan let's also disable unidecode by default instead of enabling it, and include a note in the changes or readme about this.

jqnatividad commented 1 year ago

Yes @wardi, in DP+, especially since it supports spreadsheets, we found a lot of users were uploading worksheets with column header names that were not valid PostgreSQL column identifiers.

That's why I created qsv's safenames command that Datapusher+ leverages.

It guarantees that it will only create "safe names" - sanitizing column header names so they are not only valid Postgres identifiers, it will also take into account additional CKAN restrictions - (columns that start with "_" and the special "_id" column), and ensure duplicate column names are handled as well.

Further, DP+ adds the original name of the column to the Data Dictionary as the column labels. The labels are used automatically when viewing the resource using DataTables view instead of the sanitized name - in that way, you still get the original header names when viewing the dataset, but the "safe", machine-readable header names are the ones used when using the Datastore API or exporting the data.

While we're on the topic of aliases - DP+ also has an "auto-alias" option - automatically creating a human-readable resource alias (by default - org-packagename-resourcename with optional collision detection) that you can use instead of the resource id hash in CKAN API calls and in SQL queries (as the CKAN resource alias is really just a PostgreSQL view).

https://github.com/dathere/datapusher-plus/blob/3a3c69687f16973cbdfa9f52a4d6db11ac48f908/datapusher/dot-env.template#L132-L143

@aimalkhan , if you're so inclined, I encourage you to try out DP+ as I'd love to get it exercised with more languages. I took care to make sure it supports UTF-8 properly, and I have some folks in Europe, China and Japan testing it and I'd love to see more.

FYI, while we were at OpenGov, we sponsored xloader, as the original datapusher was the largest source of post-CKAN installation issues. DP+ combines the best of datapusher and xloader since then, and I'd be interested in your feedback.

wardi commented 1 year ago

@jqnatividad for the alphanumeric test are you using the unicode categories L* and N*?

jqnatividad commented 1 year ago

@wardi I have no specific test for those unicode categories, but since Rust strings are UTF-8, I "think" it's something we get for free.

Do you have specific test cases I can try to check out support for those specific categories?