TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
4 stars 6 forks source link

Non-ASCII characters from updates.csv in log files #372

Open jteresco opened 3 years ago

jteresco commented 3 years ago

In #189, the issue has come up about non-ASCII characters from updates.csv entries now getting into user .log files, and not displaying properly either when generated by Python or served by Apache on FreeBSD.

It seems the .log files should be generated with a UTF-8 encoding.

yakra commented 3 years ago

It seems the .log files should be generated with a UTF-8 encoding.

~I tried slapping a UTF-8 byte order mark at the beginning of a .log file, and serving it from yakra.teresco.org. It was a no go.~

when generated by Python

Python & C++ flavored logs are the same.

diff <(tail -n +2 python-teresco/logs/users/$username.log) <(tail -n +2 cplusplus/logs/users/$username.log)

produces no output. Just to be sure, I'll run a Python-flavored site update on lab2 and edit this post. Edit: I see Baden-Württemberg. Python-flavored logs look good on CentOS.

or served by Apache on FreeBSD

I think this is where the issue lies.

yakra commented 3 years ago

It seems the .log files should be generated with a UTF-8 encoding.

If I slap a UTF-8 byte order mark at the beginning of the .log file, things look better, but not perfect: http://yakra.teresco.org/logs/UTF-8/duke87.orig.log http://yakra.teresco.org/logs/UTF-8/duke87.utf8.log I'm getting Montréal, Bécancour and Québec on CentOS as well. Note also that we have Montréal on A-20. Hopefully once I have a go at canqca_con.csv ~(also canqca.csv?)~ with a hex editor this will clear up too. Edit: See https://github.com/TravelMapping/HighwayData/pull/4377

yakra commented 3 years ago

@jteresco, What do you have for AddDefaultCharset in httpd.conf? Seems sometimes it has problems?

yakra commented 3 years ago

Non-ascii chars also appear in siteupdate.log when listing the commented-out systems.csv lines.