TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
4 stars 6 forks source link

Datachecks for field lengths #56

Closed jteresco closed 4 years ago

jteresco commented 7 years ago

A system should be put in place where all entries from TM data files that end up in DB fields are checked for length to avoid problems like https://github.com/TravelMapping/Web/issues/195 .

A set of constants should be introduced to the site update program to make sure these values are consistent among all checks and when DB tables are created.

jteresco commented 6 years ago

More evidence that this should happen: http://tm.teresco.org/forum/index.php?topic=2358.0

jteresco commented 6 years ago

This is now more important, and should be combined with a truncation of entries that are too long, since MySQL 5.7 will not allow overlength fields to be added to a column. Increasing (possibly temporarily, possibly to be replaced with a TEXT field) the width of the description in the updates table.

yakra commented 5 years ago

This could be done during the Route constructor. el.add_error, and, yes, halt the siteupdate process. :)

yakra commented 5 years ago

Forgot about this one. I implemented at least part of it in C++: https://github.com/TravelMapping/DataProcessing/blob/0817bdd1341dc2fca907e5cd0e0465c9e488b45b/siteupdate/cplusplus/classes/Route/Route.cpp#L56-L59

yakra commented 5 years ago

MySQL 8 won't even insert the overlength values unto the tables, and the site update process halts. ERROR 1406 (22001) at line 1074: Data too long for column 'banner' at row 49532 fraidfd is commented out on lab2 until the HighwayData side is resolved.

yakra commented 5 years ago

The above means these should be handled at the ErrorList level of the siteupdate program, rather than create a DatacheckEntry that would need a successful site update in order to be shown on siteupdate.php.

yakra commented 5 years ago

C++

ABORTING due to 3 errors:
1: Abbrev >3 characters in usafl.csv line: [usafl;FL;FL44;;Leesburg;;fl.fl044;], Leesburg
2: Could not find Route matching root fl.fl044 in system usafl.
3: No roots in usafl_con.csv line [usafl;FL44;;Leesburg;fl.fl044]
Fri Sep  6 11:51:20 EDT 2019

The 2nd and 3rd errors happen because the Route constructor aborts, and then the HighwaySystem ctor does if (!route_list.back().is_valid()) route_list.pop_back();

yakra commented 4 years ago

http://forum.travelmapping.net/index.php?topic=3147.msg18099#msg18099

yakra commented 4 years ago

In the OP, @jteresco wrote:

A set of constants should be introduced to the site update program to make sure these values are consistent among all checks and when DB tables are created.

table text field proposed
constant
current
length
clinched traveler traveler 48
clinchedConnectedRoutes route root 32
clinchedConnectedRoutes traveler traveler 48
clinchedOverallMileageByRegion region regionCode 8
clinchedOverallMileageByRegion traveler traveler 48
clinchedRoutes route root 32
clinchedRoutes traveler traveler 48
clinchedSystemMileageByRegion systemName systemName 10
clinchedSystemMileageByRegion region regionCode 8
clinchedSystemMileageByRegion traveler traveler 48
connectedRouteRoots firstRoot root 32
connectedRouteRoots root root 32
connectedRoutes systemName systemName 10
connectedRoutes route route 16
connectedRoutes banner banner 6
connectedRoutes groupName ? 100
connectedRoutes firstRoot root 32
continents code continentCode 3
continents name continentName 15
countries code countryCode 3
countries name countryName 32
datacheckErrors route root 32
datacheckErrors label1 label 50
datacheckErrors label2 label 20
datacheckErrors label3 label 20
datacheckErrors code dcErrCode 22
datacheckErrors value dcErrValue 32
graphTypes category graphCategory 12
graphTypes descr graphDescr 100
graphTypes longDescr N/A N/A
graphs filename graphFilename 32
graphs descr graphDescr 100
graphs format graphFormat 10
graphs category graphCategory 12
overallMileageByRegion region regionCode 8
regions code regionCode 8
regions name regionName 48
regions country countryCode 3
regions continent continentCode 3
regions regiontype regiontype 32
routes systemName systemName 10
routes region regionCode 8
routes route route 16
routes banner banner 6
routes abbrev abbrev 3
routes city ? 100
routes root root 32
segments root root 32
systemMileageByRegion systemName systemName 10
systemMileageByRegion region regionCode 8
systemUpdates date date 10
systemUpdates region regionExtDescr
regionLongDescr
countryRegion
or...?
48
systemUpdates systemName systemName 10
systemUpdates description systemFullName 128
systemUpdates statusChange statusChange 16
systems systemName systemName 10
systems countryCode countryCode 3
systems fullName systemFullName 60
systems color color 16
systems level level 10
updates date date 10
updates region regionExtDescr
regionLongDescr
countryRegion
or...?
60
updates route routeLongName 80
updates root root 32
updates description updateText 1024
waypoints pointName label 20
waypoints root root 32

C++:

Because Globals Are Bad, we could have these be static members of a class to avoid having to pass around another data structure between functions. Just do

sqlfile << "CREATE TABLE continents (code VARCHAR(" << dbfieldlength::continentCode
    << "), name VARCHAR(" << dbfieldlength::continentName
    << "), PRIMARY KEY(code));\n";

and off we go.

Python:

Less familiar with this, but it looks like we can do static class constants too.

jteresco commented 4 years ago

I don't see any great reason to expand the ones that have examples almost at the limit, but I also wouldn't object. These are small items by comparison to many of the others in the DB.

yakra commented 4 years ago

I don't see any great reason to expand the ones that have examples almost at the limit, but I also wouldn't object. These are small items by comparison to many of the others in the DB.

Let's leave these as they are, then. We can expand them if we ever need to in the future. It's one of the things the constants are meant to facilitate.

yakra commented 4 years ago

In some cases we won't need to check for valid length because our input must match data that's already been verified. For example, matching ConnectedRoute roots to an existing Route object's root field.

yakra commented 4 years ago

over-length labels will be implemented as a standard datacheck, LABEL_TOO_LONG. ToDo: prevent datacheck.php from displaying error codes.

yakra commented 4 years ago

ToDo:

siteupdate.py
approx. line
try to open
992 wpt file
1171 chopped CSV
1182 _con CSV
2236 continents.csv
2260 countries.csv
2284 regions.csv
2337 systems.csv
yakra commented 4 years ago

"(countryName) regionName". countryName of 32 + regionName of 48 + 3 more chars for parentheses and a space = 83. Shall we expand the systemUpdates.region and updates.region to 83 characters?

Similarly, datacheckErrors.value for VISIBLE_HIDDEN_COLOC cases is root@label. root of 32 + label of 26 + 1 more char for @ = 59.

Instead of hard-coding these constants, we can set them relative to the other constants.

yakra commented 4 years ago

It's possible for HighwayData contributors to add data to areagraphs.csv, multiregion.csv and multisystem.csv that would result in a filename over the limit of what the DB can store. Ideally, we'd add to the ErrorList and abort.

Except that we check ErrorList and abort before Setting up for graphs of highway data. This was intentional -- @jteresco wrote in #162:

Recent changes have meant that when a data error slips in, the "ABORTING due to N errors" doesn't happen until after (the time consuming) generation of all the subgraphs. I had gotten used to being able to ignore the site update process once graph generation started. I'd like to see about getting that ABORTING condition moved back earlier so I have a better chance to notice the problem.

The subgraph speed optimizations started later that month, and graph generation is now much less time consuming -- 2m27s (including traveled graphs!) vice 21m40s.

My hope is that this is a tolerable amount of time for the ABORTING condition to be moved back after graph generation to catch the remaining errors. I'd prefer this over the couple of alternatives I can think of.

jteresco commented 4 years ago

Yes, definitely can move after the graphs now to get the more complete error list before aborting. One possibility that is probably not worth the trouble is once any error has been detected, we skip things like writing of the actual files.

yakra commented 4 years ago

Skipping all area, multiregion & multisystem graphs would save us a total of 13.2s (Python/noreaster). Skipping only whichever new ones cause the error would most likely be an even smaller difference. I can see the utility in generating a desired file, just with a too-long name, and just skipping writing the DB. Call it "not worth the trouble"?

yakra commented 4 years ago

The longest full system name right now is 56 characters: Île-de-France Routes Départementales (Seine-Saint-Denis).

I think when I ran LongFields.py, I had an old head checked out (the one I use for speed tests) that predates frabfcd70;FRA;Bourgogne-Franche-Comté Routes Départementales (Haute-Saône);yellow;5;devel. Interestingly, LongFields.py reports this as 60 characters, seemingly right up against the 60 char DB limit. But it turns out this has 3 multi-byte characters, making it total of 63 bytes. The revised C++ program correctly reports this as oversized. I'm still working out some weird runtime bugs with that version, and have not tested the new siteupdate.py yet. It'll be interesting to see what it has to say about this when I run it.

There are a couple more systems still commented out with longer names:

63  Bourgogne-Franche-Comté Routes Départementales (Côte-d’Or)
63  Bourgogne-Franche-Comté Routes Départementales (Haute-Saône)
66  Bourgogne-Franche-Comté Routes Départementales (Saône-et-Loire)
72  Bourgogne-Franche-Comté Routes Départementales (Territoire de Belfort)

My first thought was to make a stopgap expansion of systems.fullName pending sorting out the rest of this issue, but closer inspection reveals many Routes Départementales systems' names to be questionable and able to be shortened. I'll take this to the forum.

yakra commented 4 years ago

Just queried the DB on lab2; turns out that Bourgogne-Franche-Comté Routes Départementales (Haute-Saône), at 60 characters, can in fact fit in the DB.

Regardless of how the Python version behaves, this complicates the C++ version quite a bit, as the "error" it's reporting is a false positive.

Fun fun. I think I'll take a break & spend my next couple days of quarantine getting reacquainted with my classic video game systems.

yakra commented 4 years ago

Just queried the DB on lab2; turns out that Bourgogne-Franche-Comté Routes Départementales (Haute-Saône), at 60 characters, can in fact fit in the DB.

I'm guessing this changed some time between MySQL 5.7 & 8.0, along with the transition from 5.7 accepting & truncating overlength values, to 8.0 stopping with an error message...

I'm bummed that I didn't take a screenshot of travelmapping.net when things were messed up, but ISTR the ñ being displayed as a �, as in https://github.com/TravelMapping/Web/issues/195. Here's how it appears on lab2: MEX57Cañ

Seems the thing to do is plan for the most restrictive circumstances, that we're limited by byte counts rather than character counts. Pro: We don't have to get the C++ version to do character counts. Con: We do have to get the Python version to do byte counts.

Edit: OK, it's pretty straightforward:

>>> print(len('ñ'.encode('utf-8')))
2
yakra commented 4 years ago

diffs


yakra commented 4 years ago

OK, finally about ready to go on this. Considering all the small things to watch out for & fix that kept popping up, it'd be wise to break one of everything in HighwayData and make sure everything behaves as expected.

Since the C++ version is all about speed, it writes the .sql file in the background during the graph generation process. Thus we have almost a complete file by the time we reach the ABORTING condition. All that's left are datacheckErrors, graphTypes & graphs. This is sometimes reported as taking 0.1s to write, so I just decided to finish it off before terminating. Not much harm in a junk .sql file hanging around, and it's there completely if you wanna do a post-mortem grep. And who knows, depending on the errors involved, it may even be possible to manually ingest it into a lab2 with no, or even a noreaster with minimal, ill effects. What matters is that the program terminates with an exit code of 1, so localupdate.sh knows to not continue.

A final thought: How about checking for unrecognized colors in systems.csv? We could have a set of the "official TM" colors, and check each systems.csv datum for membership in this set, catching potential misspellings etc. Just that DBFieldLength is no longer the best class name for all our constants. Ideas on a better name? Thoughts?

yakra commented 4 years ago
jteresco commented 4 years ago

A thought on colors, probably overkill. We could go a bit further, having a colors.csv that defines names and RGB values for the "official" TM colors, introducing a new DB table to store them, the systems table would refer to those entries, and the front-end code would load up its set of colors from that new DB table.

yakra commented 4 years ago

I don't see a big need to specify RGB values, as we can already override them, albeit on a per-tier rather than per-color basis. I can happily agree to "probably overkill", and leave this feature out. It'll be good to get this enhancement out the door.

Before the big stress test gets underway, while I'm in the CSV parsing routes I'm giving the C++ ones a much-needed cleanup, and tightening a couple other minor screws. After, there will be at least one merge conflict to resolve on my end.

yakra commented 4 years ago

There's a remote chance that the MALFORMED_LAT & MALFORMED_LON datachecks can produce overlength info values. I included checks for this, but the Python version is not 100% foolproof. Rather than spend time on a fix and then gunk up the commit history with obsolete code that will just be overwritten again in a few days, I think I'll scrap these changes (in both languages) & go no-build pending a cleanup per #299.

yakra commented 4 years ago
cd ~/TravelMapping/yakra/DataProcessing/siteupdate/cplusplus/

g++ siteupdate.cpp -std=c++11 -pthread -o siteupdate_const 2>&1
./siteupdate_const -t 4 -l logs -n nmp_merged -c stats -g graphs -u /home/yakra/TravelMapping/UserData/list_files -w /home/yakra/TravelMapping/HighwayData | tee siteupdate_m1.log
grep -n ABORTING siteupdate_m1.log; AbortLineCPP=$(grep -n ABORTING siteupdate_m1.log | cut -f1 -d:)
tail -n +$AbortLineCPP siteupdate_m1.log | less

cd ../python-teresco

./siteupdate.py -l logs -n nmp_merged -c stats -g graphs -u /home/yakra/TravelMapping/UserData/list_files -w /home/yakra/TravelMapping/HighwayData -t 1 | tee siteupdate_s1.log
grep -n ABORTING siteupdate_s1.log; AbortLinePY=$(grep -n ABORTING siteupdate_s1.log | cut -f1 -d:)
tail -n +$AbortLinePY siteupdate_s1.log | less

cd ..

diff <(tail -n +`expr $AbortLineCPP + 1` cplusplus/siteupdate_m1.log | sed 's~^[0-9]\+: ~~' | head -n 40) <(tail -n +`expr $AbortLinePY + 1` python-teresco/siteupdate_s1.log | sed 's~^[0-9]\+: ~~' | head -n 40)
yakra commented 4 years ago
        # look up country and continent, add index into those arrays
        # in case they're needed for lookups later (not needed for DB)
        for i in range(len(countries)):
            country = countries[i][0]
            if country == fields[2]:
                fields.append(i)
                break
        if len(fields) != 6:
            el.add_error("Could not find country matching regions.csv line: " + line)
            continue
        for i in range(len(continents)):
            continent = continents[i][0]
            if continent == fields[3]:
                fields.append(i)
                break
        if len(fields) != 7:
            el.add_error("Could not find continent matching regions.csv line: " + line)
            continue

I've retained the lookups for matching countries & continents. Any objections if I nix storing the indices as part of the fields array?

yakra commented 4 years ago
jteresco commented 4 years ago

I've retained the lookups for matching countries & continents. Any objections if I nix storing the indices as part of the fields array?

I am not sure why they were there in the first place, so I suppose no harm in getting rid of them.

yakra commented 4 years ago

# in case they're needed for lookups later (not needed for DB)

Sounds like a case of planning for the future, not fully knowing yet what's going to be done with the data.

I did simillar when starting out the C++ version, looking up & storing a pointer the the continent & country std::pairs. Nothing much is done with them, but the only less RAM-intensive option would have been to include a C char array in the Region object. :)