Datachecks for field lengths

jteresco commented 7 years ago

A system should be put in place where all entries from TM data files that end up in DB fields are checked for length to avoid problems like https://github.com/TravelMapping/Web/issues/195 .

A set of constants should be introduced to the site update program to make sure these values are consistent among all checks and when DB tables are created.

jteresco commented 6 years ago

More evidence that this should happen: http://tm.teresco.org/forum/index.php?topic=2358.0

jteresco commented 6 years ago

This is now more important, and should be combined with a truncation of entries that are too long, since MySQL 5.7 will not allow overlength fields to be added to a column. Increasing (possibly temporarily, possibly to be replaced with a TEXT field) the width of the description in the updates table.

yakra commented 5 years ago

This could be done during the Route constructor. el.add_error, and, yes, halt the siteupdate process. :)

yakra commented 5 years ago

Forgot about this one. I implemented at least part of it in C++: https://github.com/TravelMapping/DataProcessing/blob/0817bdd1341dc2fca907e5cd0e0465c9e488b45b/siteupdate/cplusplus/classes/Route/Route.cpp#L56-L59

yakra commented 5 years ago

MySQL 8 won't even insert the overlength values unto the tables, and the site update process halts. ERROR 1406 (22001) at line 1074: Data too long for column 'banner' at row 49532 fraidfd is commented out on lab2 until the HighwayData side is resolved.

yakra commented 5 years ago

The above means these should be handled at the ErrorList level of the siteupdate program, rather than create a DatacheckEntry that would need a successful site update in order to be shown on siteupdate.php.

yakra commented 5 years ago

C++

ABORTING due to 3 errors:
1: Abbrev >3 characters in usafl.csv line: [usafl;FL;FL44;;Leesburg;;fl.fl044;], Leesburg
2: Could not find Route matching root fl.fl044 in system usafl.
3: No roots in usafl_con.csv line [usafl;FL44;;Leesburg;fl.fl044]
Fri Sep  6 11:51:20 EDT 2019

The 2nd and 3rd errors happen because the Route constructor aborts, and then the HighwaySystem ctor does if (!route_list.back().is_valid()) route_list.pop_back();

yakra commented 4 years ago

http://forum.travelmapping.net/index.php?topic=3147.msg18099#msg18099

yakra commented 4 years ago

In the OP, @jteresco wrote:

A set of constants should be introduced to the site update program to make sure these values are consistent among all checks and when DB tables are created.

table	text field	proposed constant	current length
clinched	traveler	traveler	48
clinchedConnectedRoutes	route	root	32
clinchedConnectedRoutes	traveler	traveler	48
clinchedOverallMileageByRegion	region	regionCode	8
clinchedOverallMileageByRegion	traveler	traveler	48
clinchedRoutes	route	root	32
clinchedRoutes	traveler	traveler	48
clinchedSystemMileageByRegion	systemName	systemName	10
clinchedSystemMileageByRegion	region	regionCode	8
clinchedSystemMileageByRegion	traveler	traveler	48
connectedRouteRoots	firstRoot	root	32
connectedRouteRoots	root	root	32
connectedRoutes	systemName	systemName	10
connectedRoutes	route	route	16
connectedRoutes	banner	banner	6
connectedRoutes	groupName	?	100
connectedRoutes	firstRoot	root	32
continents	code	continentCode	3
continents	name	continentName	15
countries	code	countryCode	3
countries	name	countryName	32
datacheckErrors	route	root	32
datacheckErrors	label1	label	50
datacheckErrors	label2	label	20
datacheckErrors	label3	label	20
datacheckErrors	code	dcErrCode	22
datacheckErrors	value	dcErrValue	32
graphTypes	category	graphCategory	12
graphTypes	descr	graphDescr	100
graphTypes	longDescr	N/A	N/A
graphs	filename	graphFilename	32
graphs	descr	graphDescr	100
graphs	format	graphFormat	10
graphs	category	graphCategory	12
overallMileageByRegion	region	regionCode	8
regions	code	regionCode	8
regions	name	regionName	48
regions	country	countryCode	3
regions	continent	continentCode	3
regions	regiontype	regiontype	32
routes	systemName	systemName	10
routes	region	regionCode	8
routes	route	route	16
routes	banner	banner	6
routes	abbrev	abbrev	3
routes	city	?	100
routes	root	root	32
segments	root	root	32
systemMileageByRegion	systemName	systemName	10
systemMileageByRegion	region	regionCode	8
systemUpdates	date	date	10
systemUpdates	region	regionExtDescr regionLongDescr countryRegion or...?	48
systemUpdates	systemName	systemName	10
systemUpdates	description	systemFullName	128
systemUpdates	statusChange	statusChange	16
systems	systemName	systemName	10
systems	countryCode	countryCode	3
systems	fullName	systemFullName	60
systems	color	color	16
systems	level	level	10
updates	date	date	10
updates	region	regionExtDescr regionLongDescr countryRegion or...?	60
updates	route	routeLongName	80
updates	root	root	32
updates	description	updateText	1024
waypoints	pointName	label	20
waypoints	root	root	32

label datacheckErrors.label1 was expanded a bit after the VISIBLE_HIDDEN_COLOC datacheck was first introduced, which at the time used long graph vertex style labels. It's since been changed to list the waypoint label used in the flagged route, so this field's length can be reduced to match the other waypoint label fields. These are currently limited to 20 characters. I've uncovered an 18-character label so far, with a shell script still chugging away. While that one deserves a forum post, it's possible a contributor may use a longer label in a hidden waypoint for note-taking purposes. Which is legitimate, IMO. Reaching for the outer limits of what we may expect for a visible label, I come up with CHIH 1604 W AltBypCla(410B)_N, at 26 characters. (I can only hope that should we ever come across a "South Dear Leader His Majestic Royal Highness The First And Foremost Exalted Grandest Of The Grand Poobah Among Poobahs Sir Comrade Charles Christian 'Otto Yamamoto' Slater Junior Esquire Boulevard West" that needs a waypoint, we'd name it ChaSlaBlvd or OttoYamBlvd and not... uhh... yeah.) To be on the safe side & provide a bit of headroom, shall we expand the waypoint label fields (datacheckErrors.label1, datacheckErrors.label2, datacheckErrors.label3, waypoints.pointName) to 26 characters?
regionExtDescr, regionLongDescr, countryRegion... ? I'm not sure what to call this. Any preferences or suggestions? As seen in the updates and systemUpdates tables, these often take the form of "(countryName) regionName". countryName of 32 + regionName of 48 + 3 more chars for parentheses and a space = 83. Shall we expand the systemUpdates.region and updates.region to 83 characters?
route Not to be confused with where a DB field named route contains a root. (Maybe call it routeName?) 16 characters ought to be fine. The two 12-char examples I've seen (there may be others) appear to defy naming conventions anyway.
systemFullName systemUpdates.description contains full highway system names. Some that are no longer in use, but the same basic stuff. I see no need to use a different constant from systems.fullName. The longest full system name right now is 56 characters: Île-de-France Routes Départementales (Seine-Saint-Denis). Limit both fields to 60 characters? Add a few more chars' headroom to be on the safe side?
connectedRoutes.groupName and routes.city often but not always match. The one is based off the other; they're the same basic idea. I see no reason to use different constants for these two fields. Any good suggestions on what to call this one?
routeLongName The longest current example is 77 chars out of 80 allowed: Skorostnaya Avtomobil'naya Doroga Moskva - Sankt Peterburg (Vyshny Volochyok). Add a few more chars' headroom to be on the safe side?

C++:

Because Globals Are Bad, we could have these be static members of a class to avoid having to pass around another data structure between functions. Just do

sqlfile << "CREATE TABLE continents (code VARCHAR(" << dbfieldlength::continentCode
    << "), name VARCHAR(" << dbfieldlength::continentName
    << "), PRIMARY KEY(code));\n";

and off we go.

Python:

Less familiar with this, but it looks like we can do static class constants too.

jteresco commented 4 years ago

I don't see any great reason to expand the ones that have examples almost at the limit, but I also wouldn't object. These are small items by comparison to many of the others in the DB.

yakra commented 4 years ago

I don't see any great reason to expand the ones that have examples almost at the limit, but I also wouldn't object. These are small items by comparison to many of the others in the DB.

Let's leave these as they are, then. We can expand them if we ever need to in the future. It's one of the things the constants are meant to facilitate.

Expanding label to handle 26 chars for a "worst-case" like CHIH1604WAltBypCla(410B)_N.
Choosing countryRegion as the most descriptive of what its field contains, as "regionExtDescr" or "regionLongDescr" could be confused with name+regionType concatenations (E.G. Saint Barthelemy (Overseas Collectivity)) that don't actually show up in a single DB field.
systemFullName will be set to 60 chars; less space will be allotted in the systemUpdates table, which should not be a problem.
Using city for connectedRoutes.groupName and routes.city.

yakra commented 4 years ago

In some cases we won't need to check for valid length because our input must match data that's already been verified. For example, matching ConnectedRoute roots to an existing Route object's root field.

yakra commented 4 years ago

over-length labels will be implemented as a standard datacheck, LABEL_TOO_LONG. ToDo: prevent datacheck.php from displaying error codes.

yakra commented 4 years ago

ToDo:

[x] Right now, multiple errors on a single CSV line will require multiple passes thru siteupdate to be added to the ErrorList & displayed to the user, due to the way the continue statements are set up. Fix.
[x] C++: { in place of } spotted. Look for more.
[x] C++: tidy up indentation on el.add_error calls.
[x] checks for graphs table values, starting with filenames:
- subtract 14 B for "-collapsed.tmg"
- any values of the following are safe with the present field sizes:
- \<continentCode>+"-continent"
- \<countryCode>+"-country"
- \<systemName>+"-system"
- \<regionCode>+"-region"
- [x] multisystem & multiregion: fields[1] maxes out at DBFieldLength::graphFilename - 14
- [x] area: a.base + to_string(a.r) maxes out at DBFieldLength::graphFilename - 19
[x] sql_file.cpp & corresponding Python
fix missing std::to_string / str calls:
- compiler may not complain, because char pointer arithmetic.
- [x] look @ new code in GitKraken & fix, in both versions
- [x] See what the compiler does complain about. Fix & recheck Python.
- [x] Python will complain. Fix & recheck C++.
[x] extraneous C++ strings & length checks in Region.cpp; move back toward the old char* method
What "pars" items were flagged as "!xp" in C++?, where we don't report the number of fields? Do halves & halves, with el.add_error for unexpected field counts
- [x] HighwaySystem
- one last strtok to check for too many fields
- [x] Region (see above)
- [x] any other similar classes done by strtok? None.
- is anything strtok not using delete/return?
- expected n populated fields
- one last strtok to check for too many fields
[x] new delete/return combos? How is this stuff handled & checked for validity once the ctor returns? (Should all be fine, but best to be safe.)
- HighwaySystem: if level is assigned a value
- Region: if type is assigned a value, and country & continent are found
[x] "characters" -> "bytes"
[x] grep & diff CREATE TABLE lines in TravelMapping.sql: before & after, Python & C++
[x] Consider whether to check for unrecognised colours in HighwaySystem ctor
[x] Mirror exceptions in C++. ~Verbatim?~...
- [x] Will these be reported the same way by Ubuntu/CentOS/BSD? (Yes? Yes.)

siteupdate.py approx. line	try to open
992	wpt file
1171	chopped CSV
1182	_con CSV
2236	continents.csv
2260	countries.csv
2284	regions.csv
2337	systems.csv

yakra commented 4 years ago

"(countryName) regionName". countryName of 32 + regionName of 48 + 3 more chars for parentheses and a space = 83. Shall we expand the systemUpdates.region and updates.region to 83 characters?

Similarly, datacheckErrors.value for VISIBLE_HIDDEN_COLOC cases is root@label. root of 32 + label of 26 + 1 more char for @ = 59.

Instead of hard-coding these constants, we can set them relative to the other constants.

yakra commented 4 years ago

It's possible for HighwayData contributors to add data to areagraphs.csv, multiregion.csv and multisystem.csv that would result in a filename over the limit of what the DB can store. Ideally, we'd add to the ErrorList and abort.

Except that we check ErrorList and abort before Setting up for graphs of highway data. This was intentional -- @jteresco wrote in #162:

Recent changes have meant that when a data error slips in, the "ABORTING due to N errors" doesn't happen until after (the time consuming) generation of all the subgraphs. I had gotten used to being able to ignore the site update process once graph generation started. I'd like to see about getting that ABORTING condition moved back earlier so I have a better chance to notice the problem.

The subgraph speed optimizations started later that month, and graph generation is now much less time consuming -- 2m27s (including traveled graphs!) vice 21m40s.

My hope is that this is a tolerable amount of time for the ABORTING condition to be moved back after graph generation to catch the remaining errors. I'd prefer this over the couple of alternatives I can think of.

jteresco commented 4 years ago

Yes, definitely can move after the graphs now to get the more complete error list before aborting. One possibility that is probably not worth the trouble is once any error has been detected, we skip things like writing of the actual files.

yakra commented 4 years ago

Skipping all area, multiregion & multisystem graphs would save us a total of 13.2s (Python/noreaster). Skipping only whichever new ones cause the error would most likely be an even smaller difference. I can see the utility in generating a desired file, just with a too-long name, and just skipping writing the DB. Call it "not worth the trouble"?

yakra commented 4 years ago

The longest full system name right now is 56 characters: Île-de-France Routes Départementales (Seine-Saint-Denis).

I think when I ran LongFields.py, I had an old head checked out (the one I use for speed tests) that predates frabfcd70;FRA;Bourgogne-Franche-Comté Routes Départementales (Haute-Saône);yellow;5;devel. Interestingly, LongFields.py reports this as 60 characters, seemingly right up against the 60 char DB limit. But it turns out this has 3 multi-byte characters, making it total of 63 bytes. The revised C++ program correctly reports this as oversized. I'm still working out some weird runtime bugs with that version, and have not tested the new siteupdate.py yet. It'll be interesting to see what it has to say about this when I run it.

There are a couple more systems still commented out with longer names:

63  Bourgogne-Franche-Comté Routes Départementales (Côte-d’Or)
63  Bourgogne-Franche-Comté Routes Départementales (Haute-Saône)
66  Bourgogne-Franche-Comté Routes Départementales (Saône-et-Loire)
72  Bourgogne-Franche-Comté Routes Départementales (Territoire de Belfort)

My first thought was to make a stopgap expansion of systems.fullName pending sorting out the rest of this issue, but closer inspection reveals many Routes Départementales systems' names to be questionable and able to be shortened. I'll take this to the forum.

yakra commented 4 years ago

Just queried the DB on lab2; turns out that Bourgogne-Franche-Comté Routes Départementales (Haute-Saône), at 60 characters, can in fact fit in the DB.

Regardless of how the Python version behaves, this complicates the C++ version quite a bit, as the "error" it's reporting is a false positive.

Fun fun. I think I'll take a break & spend my next couple days of quarantine getting reacquainted with my classic video game systems.

yakra commented 4 years ago

Just queried the DB on lab2; turns out that Bourgogne-Franche-Comté Routes Départementales (Haute-Saône), at 60 characters, can in fact fit in the DB.

I'm guessing this changed some time between MySQL 5.7 & 8.0, along with the transition from 5.7 accepting & truncating overlength values, to 8.0 stopping with an error message...

TravelMapping-2020-03-24@21:13:20.sql worked fine on noreaster
lab2 barfed on it: ERROR 1406 (22001) at line 726: Data too long for column 'fullName' at row 362 (the offending datum being Finland Seututiet (outlined - temp preview to fix active routes))

I'm bummed that I didn't take a screenshot of travelmapping.net when things were messed up, but ISTR the ñ being displayed as a �, as in https://github.com/TravelMapping/Web/issues/195. Here's how it appears on lab2: MEX57Cañ

Seems the thing to do is plan for the most restrictive circumstances, that we're limited by byte counts rather than character counts. Pro: We don't have to get the C++ version to do character counts. Con: We do have to get the Python version to do byte counts.

Edit: OK, it's pretty straightforward:

>>> print(len('ñ'.encode('utf-8')))
2

yakra commented 4 years ago

diffs

The old C++ version would stop processing a Route object if it encountered an overlength Banner or Abbrev. Fixed. Once I remove bel.n712pel, there are no diffs between the old and new logs/stats/graphs/nmp_merged.
[x] siteupdate.py is running right now, 2nd pass just started. ToDo next.
[x] And then, diff the new C++ & new Python output.

[x] After that, on to changing all the new len('ñ') to len('ñ'.encode('utf-8')) checks.

yakra commented 4 years ago

OK, finally about ready to go on this. Considering all the small things to watch out for & fix that kept popping up, it'd be wise to break one of everything in HighwayData and make sure everything behaves as expected.

Since the C++ version is all about speed, it writes the .sql file in the background during the graph generation process. Thus we have almost a complete file by the time we reach the ABORTING condition. All that's left are datacheckErrors, graphTypes & graphs. This is sometimes reported as taking 0.1s to write, so I just decided to finish it off before terminating. Not much harm in a junk .sql file hanging around, and it's there completely if you wanna do a post-mortem grep. And who knows, depending on the errors involved, it may even be possible to manually ingest it into a lab2 with no, or even a noreaster with minimal, ill effects. What matters is that the program terminates with an exit code of 1, so localupdate.sh knows to not continue.

A final thought: How about checking for unrecognized colors in systems.csv? We could have a set of the "official TM" colors, and check each systems.csv datum for membership in this set, catching potential misspellings etc. Just that DBFieldLength is no longer the best class name for all our constants. Ideas on a better name? Thoughts?

yakra commented 4 years ago

Break & test the following in C++:
- [x] region code in chopped routes .csv
- country code in:
- [x] regions.csv
- [x] systems.csv
- [x] continent code in regions.csv
[x] Keep dummy entries out of the DB

jteresco commented 4 years ago

A thought on colors, probably overkill. We could go a bit further, having a colors.csv that defines names and RGB values for the "official" TM colors, introducing a new DB table to store them, the systems table would refer to those entries, and the front-end code would load up its set of colors from that new DB table.

yakra commented 4 years ago

I don't see a big need to specify RGB values, as we can already override them, albeit on a per-tier rather than per-color basis. I can happily agree to "probably overkill", and leave this feature out. It'll be good to get this enhancement out the door.

Before the big stress test gets underway, while I'm in the CSV parsing routes I'm giving the C++ ones a much-needed cleanup, and tightening a couple other minor screws. After, there will be at least one merge conflict to resolve on my end.

yakra commented 4 years ago

There's a remote chance that the MALFORMED_LAT & MALFORMED_LON datachecks can produce overlength info values. I included checks for this, but the Python version is not 100% foolproof. Rather than spend time on a fix and then gunk up the commit history with obsolete code that will just be overwritten again in a few days, I think I'll scrap these changes (in both languages) & go no-build pending a cleanup per #299.

yakra commented 4 years ago

cd ~/TravelMapping/yakra/DataProcessing/siteupdate/cplusplus/

g++ siteupdate.cpp -std=c++11 -pthread -o siteupdate_const 2>&1
./siteupdate_const -t 4 -l logs -n nmp_merged -c stats -g graphs -u /home/yakra/TravelMapping/UserData/list_files -w /home/yakra/TravelMapping/HighwayData | tee siteupdate_m1.log
grep -n ABORTING siteupdate_m1.log; AbortLineCPP=$(grep -n ABORTING siteupdate_m1.log | cut -f1 -d:)
tail -n +$AbortLineCPP siteupdate_m1.log | less

cd ../python-teresco

./siteupdate.py -l logs -n nmp_merged -c stats -g graphs -u /home/yakra/TravelMapping/UserData/list_files -w /home/yakra/TravelMapping/HighwayData -t 1 | tee siteupdate_s1.log
grep -n ABORTING siteupdate_s1.log; AbortLinePY=$(grep -n ABORTING siteupdate_s1.log | cut -f1 -d:)
tail -n +$AbortLinePY siteupdate_s1.log | less

cd ..

diff <(tail -n +`expr $AbortLineCPP + 1` cplusplus/siteupdate_m1.log | sed 's~^[0-9]\+: ~~' | head -n 40) <(tail -n +`expr $AbortLinePY + 1` python-teresco/siteupdate_s1.log | sed 's~^[0-9]\+: ~~' | head -n 40)

yakra commented 4 years ago

        # look up country and continent, add index into those arrays
        # in case they're needed for lookups later (not needed for DB)
        for i in range(len(countries)):
            country = countries[i][0]
            if country == fields[2]:
                fields.append(i)
                break
        if len(fields) != 6:
            el.add_error("Could not find country matching regions.csv line: " + line)
            continue
        for i in range(len(continents)):
            continent = continents[i][0]
            if continent == fields[3]:
                fields.append(i)
                break
        if len(fields) != 7:
            el.add_error("Could not find continent matching regions.csv line: " + line)
            continue

I've retained the lookups for matching countries & continents. Any objections if I nix storing the indices as part of the fields array?

yakra commented 4 years ago

[x] vdiffs are slowly becoming sane
Combine where feasible (which is nowhere):
- ~if len(line) == 0 or line.startswith('#'):~
- ~if (line.empty() || line[0] == '#')~
[x] Make sure all HighwayData breakages are as expected. I know I whiffed at least 4 of'em.
[x] C++: Make rg_str part of the Route object; use it to load & process WPTs
- [x] Change validity check in HighwaySystem ctor accordingly
[x] Check DIFFs again
[x] Revert the Waypoint ctor changes
[x] Maybe revert the list delimiter changes for the time being too
[x] Merge & deal with conflicts

jteresco commented 4 years ago

I've retained the lookups for matching countries & continents. Any objections if I nix storing the indices as part of the fields array?

I am not sure why they were there in the first place, so I suppose no harm in getting rid of them.

yakra commented 4 years ago

# in case they're needed for lookups later (not needed for DB)

Sounds like a case of planning for the future, not fully knowing yet what's going to be done with the data.

I did simillar when starting out the C++ version, looking up & storing a pointer the the continent & country std::pairs. Nothing much is done with them, but the only less RAM-intensive option would have been to include a C char array in the Region object. :)

TravelMapping / DataProcessing