TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
4 stars 6 forks source link

LABEL_INVALID_CHAR & MALFORMED_URL improvements #641

Closed yakra closed 2 months ago

yakra commented 2 months ago

https://github.com/TravelMapping/RailwayData/pull/187#issuecomment-2323107288 This does run a tiny tiny bit faster on BiggaTomato. Not making any pretty pictures this time though. :) There will be another pull request in the Web repo to accompany this one.


This ended up being a bit of a rabbit hole. In a good way.

Among the invalid characters prohibited from labels are control characters. The chances of these appearing in a .wpt are pretty low, but I decided to go the extra mile to handle them just in case. We can't put these in an ErrorList message because attempting to print them to the terminal could make stuff go boom. Truncating the label or skipping or replacing the offending characters sounds good at first, but that can leave us with no info on what the offending label is or where to find it in the file. For a more robust solution, I opted to show the offset in .wpt file. If the :poop: hits the fan, load the file into your favorite hex editor and have fun. It works, even if it's not convenient, and was pretty simple to code.

Doing this involved moving the LABEL_INVALID_CHAR check into the Waypoint constructor, where we have the info needed to calculate the offset on hand. This is where things got interesting.

Looking at the results of the 1st draft, I saw: diff _37c5/logs/datacheck.log logs/datacheck.log

1c1
< Log file created at: Tue Sep 17 09:19:51 2024
---
> Log file created at: Tue Sep 17 09:20:17 2024
9424a9425,9426
> mexbc.mex003ens;#name;;;LABEL_INVALID_CHAR;
> mexbc.mex003ens;#sign;;;LABEL_INVALID_CHAR;
9427a9430,9431
> mexbc.mex003ens;http://www.openstreetmap.org/?lat=31.902864&lon=-116.541553;;;LABEL_INVALID_CHAR;
> mexbc.mex003ens;http://www.openstreetmap.org/?lat=31.911776&lon=-116.610442;;;LABEL_INVALID_CHAR;

Let's take a look at the file:

MEX3_S http://www.openstreetmap.org/?lat=31.878706&lon=-116.525202
AveCal http://www.openstreetmap.org/?lat=31.902864&lon=-116.541553 #name assumed from Google Maps
+X958691 http://www.openstreetmap.org/?lat=31.914285&lon=-116.551487
+X463217 http://www.openstreetmap.org/?lat=31.921661&lon=-116.581421
Ruiz http://www.openstreetmap.org/?lat=31.911776&lon=-116.610442 #sign says Entronque Ruiz AKA Ruiz Junction
+X892185 http://www.openstreetmap.org/?lat=31.898583&lon=-116.620055
+X413693 http://www.openstreetmap.org/?lat=31.891842&lon=-116.647296
+X278091 http://www.openstreetmap.org/?lat=31.898273&lon=-116.668088
MEX3_N http://www.openstreetmap.org/?lat=31.920022&lon=-116.682873

Of course, comments are not supported in .wpt files. So what's really happening here?

...And sure enough, in datacheck.log: mexbc.mex003ens;AveCal;;;MALFORMED_URL;MISSING_ARG(S) Maybe not the most helpful Info text. I could see someone asking, "What the hell? It's a perfectly fine URL, nothing missing!" So I've changed MALFORMED_URL to read like: mexbc.mex003ens;AveCal;;;MALFORMED_URL;Maps mexbc.mex003ens;Ruiz;;;MALFORMED_URL;Junction ...More helpful to see what it is that siteupdate's trying to parse as the URL. It'll only say "MISSING_ARG(S)" now if the string is too long to fit in the DB, or has characters that shouldn't go in it like backslashes, quotes or control chars.

Why weren't the new errors flagged before? Before, there was a Waypoint::label_invalid_char() function, called on complete Waypoint objects. Problem was, without a valid URL, there's no way to get valid coords, so the line gets thrown out, Waypoint never created. Now? Errors get flagged as label strings get validated right from the original wptdata buffer, before the URL is parsed, before a Waypoint is created/thrown out.

Altogether, the new and changed error reports should give a better indication what's going on with mexbc.mex003ens.

jteresco commented 2 months ago

This all sounds good to me. Should we go live?

yakra commented 2 months ago

Yes! Let's go live before I get any more crazy ideas like making a string out of the invalid chars to go in the Info column. Don't forget https://github.com/TravelMapping/Web/pull/797. While I have your attention, any thoughts on https://github.com/TravelMapping/DataProcessing/pull/640#issuecomment-2359975548?

yakra commented 2 months ago

@michihdeu referenced this pull request in https://github.com/TravelMapping/RailwayData/commit/a31f435952c05057889b57c1d9979e57288eb720

It appears I was tagged there, but for some reason my browser isn't rendering the page. :( What did I miss?