Question: Does the validator rely on a specific charset?

dancesWithCycles commented 3 months ago

Describe the bug

I truncate a GTFS feed with multiple agencies

head -n5 ~/Downloads/mblthk-cnnct-dhid-gtfs-dir/agency.txt 
"agency_id","agency_name","agency_url","agency_timezone","agency_lang","agency_phone"
66,"Stadtwerke Verkehrsbetriebe Wilhelmshaven GmbH","https://swwv.de/","Europe/Berlin","de","+49 4421 291257"
81,"Stadtwerke Osnabrück AG - Verkehrsbetriebe","http://www.stadtwerke-osnabrueck.de","Europe/Berlin","de","+49 541 20022211"
106,"Verkehr und Wasser GmbH (VWG)","http://www.vwg.de/","Europe/Berlin","de","+49 441 93660"
121,"Delmenhorst-Harpstedter Eisenbahn GmbH","http://www.dhe-reisen.de/","Europe/Berlin","de","+49 4244 93550"

to a GTFS feed with a single agency

head -n5 ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg/agency.txt 
agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone
231,Braunschweiger Verkehrs-GmbH,http://www.verkehr-bs.de/,Europe/Berlin,de,+49 531 28639555

using this

java -jar -Xms20G -server ./target/onebusaway-gtfs-transformer-cli.jar --transform=./mblthk-cnnct-dhid-gtfs-bsvg.txt ~/Downloads/mblthk-cnnct-dhid-gtfs ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg

instruction.

At the end I am calling the validator on the truncated GTFS feed like this

java -jar ~/Downloads/gtfs-validator-4.2.0-cli.jar -i ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg.zip -o ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg-output

and get a reply like this.

Mar 13, 2024 8:54:10 AM org.mobilitydata.gtfsvalidator.runner.ValidationRunner printSummary
INFO: Validation took 0.557 seconds

Mar 13, 2024 8:54:10 AM org.mobilitydata.gtfsvalidator.runner.ValidationRunner printSummary
INFO: agency.txt    INVALID_HEADERS
calendar.txt    INVALID_HEADERS
calendar_dates.txt  INVALID_HEADERS
routes.txt  INVALID_HEADERS
shapes.txt  INVALID_HEADERS
stop_times.txt  INVALID_HEADERS
stops.txt   INVALID_HEADERS
trips.txt   INVALID_HEADERS

System errors:

cat mblthk-cnnct-dhid-gtfs-bsvg-output/system_errors.json 
{"notices":[]}

JSON report:

[{"code":"csv_parsing_failed","severity":"ERROR","totalNotices":8,"sampleNotices":[{"filename":"calendar.txt","charIndex":0,"columnIndex":0,"lineIndex":0,"message":"java.lang.NullPointerException - Cannot invoke \"java.io.InputStream.read()\" because \"this.in\" is null\nParser Configuration: CsvParserSettings:\n\tAuto configuration enabled\u003dtrue\n\tAuto-closing enabled\u003dtrue\n\tAutodetect column delimiter\u003dfalse\n\tAutodetect quotes\u003dfalse\n\tColumn reordering enabled\u003dtrue\n\tDelimiters for detection\u003dnull\n\tEmpty value\u003dnull\n\tEscape unquoted values\u003dfalse\n\tHeader extraction enabled\u003dtrue\n\tHeaders\u003dnull\n\tIgnore leading whitespaces\u003dtrue\n\tIgnore leading whitespaces in quotes\u003dfalse\n\tIgnore trailing whitespaces\u003dtrue\n\tIgnore trailing whitespaces in quotes\u003dfalse\n\tInput buffer size\u003d1048576\n\tInput reading on separate thread\u003dtrue\n\tKeep escape sequences\u003dfalse\n\tKeep quotes\u003dfalse\n\tLength of content displayed on error\u003d-1\n\tLine separator detection enabled\u003dfalse\n\tMaximum number of characters per column\u003d4096\n\tMaximum number of columns\u003d512\n\tNormalize escaped line separators\u003dtrue\n\tNull value\u003dnull\n\tNumber of records to read\u003dall\n\tProcessor\u003dnone\n\tRestricting data in exceptions\u003dfalse\n\tRowProcessor error handler\u003dnull\n\tSelected fields\u003dnone\n\tSkip bits as whitespace\u003dtrue\n\tSkip empty lines\u003dtrue\n\tUnescaped quote handling\u003dnullFormat configuration:\n\tCsvFormat:\n\t\tComment character\u003d#\n\t\tField delimiter\u003d,\n\t\tLine separator (normalized)\u003d\\n\n\t\tLine separator sequence\u003d\\n\n\t\tQuote character\u003d\"\n\t\tQuote escape character\u003d\"\n\t\tQuote escape escape character\u003dnull\nInternal state when error was thrown: line\u003d0, column\u003d0, record\u003d0","parsedContent":""},
...

I am wondering, does the validator reply on a specific charset?

The original GTFS feed can be validated with utf-8 as charset.

file -i ~/Downloads/mblthk-cnnct-dhid-gtfs-dir/agency.txt 
./agency.txt: text/csv; charset=utf-8

The truncated GTFS feed can not be validated with us-ascii charset.

file -i ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg/agency.txt 
./agency.txt: text/csv; charset=us-ascii

Steps/Code to Reproduce

java -jar -Xms20G -server ./target/onebusaway-gtfs-transformer-cli.jar --transform=./mblthk-cnnct-dhid-gtfs-bsvg.txt ~/Downloads/mblthk-cnnct-dhid-gtfs ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg
java -jar ~/Downloads/gtfs-validator-4.2.0-cli.jar -i ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg.zip -o ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg-output

Expected Results

I expected a validation report for the truncated GTFS feed the same way I got a validation report for the original GTFS feed.

Actual Results

System errors:

cat mblthk-cnnct-dhid-gtfs-bsvg-output/system_errors.json 
{"notices":[]}

JSON report:

[{"code":"csv_parsing_failed","severity":"ERROR","totalNotices":8,"sampleNotices":[{"filename":"calendar.txt","charIndex":0,"columnIndex":0,"lineIndex":0,"message":"java.lang.NullPointerException - Cannot invoke \"java.io.InputStream.read()\" because \"this.in\" is null\nParser Configuration: CsvParserSettings:\n\tAuto configuration enabled\u003dtrue\n\tAuto-closing enabled\u003dtrue\n\tAutodetect column delimiter\u003dfalse\n\tAutodetect quotes\u003dfalse\n\tColumn reordering enabled\u003dtrue\n\tDelimiters for detection\u003dnull\n\tEmpty value\u003dnull\n\tEscape unquoted values\u003dfalse\n\tHeader extraction enabled\u003dtrue\n\tHeaders\u003dnull\n\tIgnore leading whitespaces\u003dtrue\n\tIgnore leading whitespaces in quotes\u003dfalse\n\tIgnore trailing whitespaces\u003dtrue\n\tIgnore trailing whitespaces in quotes\u003dfalse\n\tInput buffer size\u003d1048576\n\tInput reading on separate thread\u003dtrue\n\tKeep escape sequences\u003dfalse\n\tKeep quotes\u003dfalse\n\tLength of content displayed on error\u003d-1\n\tLine separator detection enabled\u003dfalse\n\tMaximum number of characters per column\u003d4096\n\tMaximum number of columns\u003d512\n\tNormalize escaped line separators\u003dtrue\n\tNull value\u003dnull\n\tNumber of records to read\u003dall\n\tProcessor\u003dnone\n\tRestricting data in exceptions\u003dfalse\n\tRowProcessor error handler\u003dnull\n\tSelected fields\u003dnone\n\tSkip bits as whitespace\u003dtrue\n\tSkip empty lines\u003dtrue\n\tUnescaped quote handling\u003dnullFormat configuration:\n\tCsvFormat:\n\t\tComment character\u003d#\n\t\tField delimiter\u003d,\n\t\tLine separator (normalized)\u003d\\n\n\t\tLine separator sequence\u003d\\n\n\t\tQuote character\u003d\"\n\t\tQuote escape character\u003d\"\n\t\tQuote escape escape character\u003dnull\nInternal state when error was thrown: line\u003d0, column\u003d0, record\u003d0","parsedContent":""},
...

Screenshots

No response

Files used

No response

Validator version

4.2.0

Operating system

Debian 12

Java version

openjdk version "17.0.10" 2024-01-16

Additional notes

No response

derhuerst commented 3 months ago

Side note: We've had Unicode-supporting tool for a loong time now, and given that GTFS inherently has an international scope, I think it should be defined in the GTFS Schedule spec that the charset should be UTF-8.

emmambd commented 3 months ago

@derhuerst To clarify, it looks like it's already a "should" within the spec under File Requirements: Files should be encoded in UTF-8 to support all Unicode characters.

@dancesWithCycles The validator does currently only rely on UTF-8 encoding. Can you clarify why you're using us-ascii?

dancesWithCycles commented 3 months ago

Hi folks, Thanks for clarification!

I am using this tool to truncate a many agency GTFS feed

to a GTFS feed with a single agency

head -n5 ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg/agency.txt 
agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone
231,Braunschweiger Verkehrs-GmbH,http://www.verkehr-bs.de/,Europe/Berlin,de,+49 531 28639555

Unfortunately the resulting single agency GTFS feed

using this

java -jar -Xms20G -server ./target/onebusaway-gtfs-transformer-cli.jar --transform=./mblthk-cnnct-dhid-gtfs-bsvg.txt ~/Downloads/mblthk-cnnct-dhid-gtfs ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg

does use a character encoding different to utf-8. I would love to get to know about a different truncation tool that reduces a many agency GTFS feed to a single agency GTFS feed but keeping the character encoding utf-8 intact to be compatible with the GTFS validator.

Cheers!

kurtraschke commented 2 months ago

A few of the details here do not seem to add up—the actual error reported by the validator is java.lang.NullPointerException - Cannot invoke "java.io.InputStream.read()" because "this.in" is null, and this seems to be because the validator was asked to validate ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg.zip, whereas the OBA GTFS transformer was asked to write output to ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg (which means it will produce a directory of loose CSV files, not a Zip!).

Without a Byte Order Mark present (which would, in fact, be unusual to find in a UTF-8-encoded file), there is no apparent difference between a plain-ASCII file and one encoded in UTF-8, so long as the file contains only ASCII characters (more precisely, characters in the Unicode Basic Latin block). In other words, the only way file knows that a file is UTF-8 is because it sees UTF-8-encoded characters. In this case, the operation to remove three agencies also removes the only character in the agency.txt file outside the Unicode Basic Latin block (an ü), and so file rightly concludes that it contains ASCII text (as in the output which I have quoted below). (Given the nature of UTF-8, such a file is inherently also valid as a UTF-8-encoded file.)

file -i ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg/agency.txt 
./agency.txt: text/csv; charset=us-ascii

Original agency.txt (note ü):

"agency_id","agency_name","agency_url","agency_timezone","agency_lang","agency_phone"
66,"Stadtwerke Verkehrsbetriebe Wilhelmshaven GmbH","https://swwv.de/","Europe/Berlin","de","+49 4421 291257"
81,"Stadtwerke Osnabrück AG - Verkehrsbetriebe","http://www.stadtwerke-osnabrueck.de","Europe/Berlin","de","+49 541 20022211"
106,"Verkehr und Wasser GmbH (VWG)","http://www.vwg.de/","Europe/Berlin","de","+49 441 93660"
121,"Delmenhorst-Harpstedter Eisenbahn GmbH","http://www.dhe-reisen.de/","Europe/Berlin","de","+49 4244 93550"

Transformed agency.txt (note absence of any characters outside the Unicode Basic Latin block):

agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone
231,Braunschweiger Verkehrs-GmbH,http://www.verkehr-bs.de/,Europe/Berlin,de,+49 531 28639555

To make a long story short, I think (without actually having these files at hand to inspect) that the matter of character encoding is a red herring; it is odd that the validator does not provide a more specific error when it cannot open the source file but given the error message reported and the fact that the paths in the excerpts in the comment above do not match, I strongly suspect that was the underlying problem.

I have just tested round-tripping the STM's GTFS through the OBA GTFS transformer, and indeed it properly preserves characters outside the Unicode Basic Latin block. More to the point, inspection of the code in onebusaway-csv-entities for writing to loose files as well as to a Zip archive shows that in both cases files are written as UTF-8.

emmambd commented 2 months ago

@kurtraschke Thank you for digging into this issue on the OBA side! From your investigation, it's clear this isn't an issue with encoding with the GTFS transformer. We do have a specific error message in the case of an invalid ZIP file. It looks like this is a case we don't sufficiently address. @dancesWithCycles, we haven't been able to reproduce your issue when we validate a folder.

For next steps:

I'll close the issue on the OBA repo (since this is a specific problem on the validator)
@dancesWithCycles, it would be great if you could provide a feed example so we can test it on our side.

MobilityData / gtfs-validator