Open dancesWithCycles opened 3 months ago
Side note: We've had Unicode-supporting tool for a loong time now, and given that GTFS inherently has an international scope, I think it should be defined in the GTFS Schedule spec that the charset should be UTF-8.
@derhuerst To clarify, it looks like it's already a "should" within the spec under File Requirements: Files should be encoded in UTF-8 to support all Unicode characters.
@dancesWithCycles The validator does currently only rely on UTF-8 encoding. Can you clarify why you're using us-ascii?
Hi folks, Thanks for clarification!
I am using this tool to truncate a many agency GTFS feed
to a GTFS feed with a single agency
head -n5 ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg/agency.txt agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone 231,Braunschweiger Verkehrs-GmbH,http://www.verkehr-bs.de/,Europe/Berlin,de,+49 531 28639555
Unfortunately the resulting single agency GTFS feed
using this
java -jar -Xms20G -server ./target/onebusaway-gtfs-transformer-cli.jar --transform=./mblthk-cnnct-dhid-gtfs-bsvg.txt ~/Downloads/mblthk-cnnct-dhid-gtfs ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg
does use a character encoding different to utf-8
. I would love to get to know about a different truncation tool that reduces a many agency GTFS feed to a single agency GTFS feed but keeping the character encoding utf-8
intact to be compatible with the GTFS validator.
Cheers!
A few of the details here do not seem to add up—the actual error reported by the validator is java.lang.NullPointerException - Cannot invoke "java.io.InputStream.read()" because "this.in" is null
, and this seems to be because the validator was asked to validate ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg.zip
, whereas the OBA GTFS transformer was asked to write output to ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg
(which means it will produce a directory of loose CSV files, not a Zip!).
Without a Byte Order Mark present (which would, in fact, be unusual to find in a UTF-8-encoded file), there is no apparent difference between a plain-ASCII file and one encoded in UTF-8, so long as the file contains only ASCII characters (more precisely, characters in the Unicode Basic Latin block). In other words, the only way file
knows that a file is UTF-8 is because it sees UTF-8-encoded characters. In this case, the operation to remove three agencies also removes the only character in the agency.txt
file outside the Unicode Basic Latin block (an ü
), and so file
rightly concludes that it contains ASCII text (as in the output which I have quoted below). (Given the nature of UTF-8, such a file is inherently also valid as a UTF-8-encoded file.)
file -i ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg/agency.txt
./agency.txt: text/csv; charset=us-ascii
Original agency.txt
(note ü
):
"agency_id","agency_name","agency_url","agency_timezone","agency_lang","agency_phone"
66,"Stadtwerke Verkehrsbetriebe Wilhelmshaven GmbH","https://swwv.de/","Europe/Berlin","de","+49 4421 291257"
81,"Stadtwerke Osnabrück AG - Verkehrsbetriebe","http://www.stadtwerke-osnabrueck.de","Europe/Berlin","de","+49 541 20022211"
106,"Verkehr und Wasser GmbH (VWG)","http://www.vwg.de/","Europe/Berlin","de","+49 441 93660"
121,"Delmenhorst-Harpstedter Eisenbahn GmbH","http://www.dhe-reisen.de/","Europe/Berlin","de","+49 4244 93550"
Transformed agency.txt
(note absence of any characters outside the Unicode Basic Latin block):
agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone
231,Braunschweiger Verkehrs-GmbH,http://www.verkehr-bs.de/,Europe/Berlin,de,+49 531 28639555
To make a long story short, I think (without actually having these files at hand to inspect) that the matter of character encoding is a red herring; it is odd that the validator does not provide a more specific error when it cannot open the source file but given the error message reported and the fact that the paths in the excerpts in the comment above do not match, I strongly suspect that was the underlying problem.
I have just tested round-tripping the STM's GTFS through the OBA GTFS transformer, and indeed it properly preserves characters outside the Unicode Basic Latin block. More to the point, inspection of the code in onebusaway-csv-entities
for writing to loose files as well as to a Zip archive shows that in both cases files are written as UTF-8.
@kurtraschke Thank you for digging into this issue on the OBA side! From your investigation, it's clear this isn't an issue with encoding with the GTFS transformer. We do have a specific error message in the case of an invalid ZIP file. It looks like this is a case we don't sufficiently address. @dancesWithCycles, we haven't been able to reproduce your issue when we validate a folder.
For next steps:
Describe the bug
I truncate a GTFS feed with multiple agencies
to a GTFS feed with a single agency
using this
instruction.
At the end I am calling the validator on the truncated GTFS feed like this
and get a reply like this.
System errors:
JSON report:
I am wondering, does the validator reply on a specific charset?
The original GTFS feed can be validated with utf-8 as charset.
The truncated GTFS feed can not be validated with us-ascii charset.
Steps/Code to Reproduce
Expected Results
I expected a validation report for the truncated GTFS feed the same way I got a validation report for the original GTFS feed.
Actual Results
System errors:
JSON report:
Screenshots
No response
Files used
No response
Validator version
4.2.0
Operating system
Debian 12
Java version
openjdk version "17.0.10" 2024-01-16
Additional notes
No response