Closed jteresco closed 4 years ago
More evidence that this should happen: http://tm.teresco.org/forum/index.php?topic=2358.0
This is now more important, and should be combined with a truncation of entries that are too long, since MySQL 5.7 will not allow overlength fields to be added to a column. Increasing (possibly temporarily, possibly to be replaced with a TEXT field) the width of the description in the updates table.
This could be done during the Route constructor. el.add_error
, and, yes, halt the siteupdate process. :)
Forgot about this one. I implemented at least part of it in C++: https://github.com/TravelMapping/DataProcessing/blob/0817bdd1341dc2fca907e5cd0e0465c9e488b45b/siteupdate/cplusplus/classes/Route/Route.cpp#L56-L59
MySQL 8 won't even insert the overlength values unto the tables, and the site update process halts.
ERROR 1406 (22001) at line 1074: Data too long for column 'banner' at row 49532
fraidfd is commented out on lab2 until the HighwayData side is resolved.
The above means these should be handled at the ErrorList
level of the siteupdate program, rather than create a DatacheckEntry
that would need a successful site update in order to be shown on siteupdate.php.
C++
ABORTING due to 3 errors:
1: Abbrev >3 characters in usafl.csv line: [usafl;FL;FL44;;Leesburg;;fl.fl044;], Leesburg
2: Could not find Route matching root fl.fl044 in system usafl.
3: No roots in usafl_con.csv line [usafl;FL44;;Leesburg;fl.fl044]
Fri Sep 6 11:51:20 EDT 2019
The 2nd and 3rd errors happen because the Route constructor aborts, and then the HighwaySystem ctor does if (!route_list.back().is_valid()) route_list.pop_back();
In the OP, @jteresco wrote:
A set of constants should be introduced to the site update program to make sure these values are consistent among all checks and when DB tables are created.
table | text field | proposed constant |
current length |
---|---|---|---|
clinched | traveler | traveler | 48 |
clinchedConnectedRoutes | route | root | 32 |
clinchedConnectedRoutes | traveler | traveler | 48 |
clinchedOverallMileageByRegion | region | regionCode | 8 |
clinchedOverallMileageByRegion | traveler | traveler | 48 |
clinchedRoutes | route | root | 32 |
clinchedRoutes | traveler | traveler | 48 |
clinchedSystemMileageByRegion | systemName | systemName | 10 |
clinchedSystemMileageByRegion | region | regionCode | 8 |
clinchedSystemMileageByRegion | traveler | traveler | 48 |
connectedRouteRoots | firstRoot | root | 32 |
connectedRouteRoots | root | root | 32 |
connectedRoutes | systemName | systemName | 10 |
connectedRoutes | route | route | 16 |
connectedRoutes | banner | banner | 6 |
connectedRoutes | groupName | ? | 100 |
connectedRoutes | firstRoot | root | 32 |
continents | code | continentCode | 3 |
continents | name | continentName | 15 |
countries | code | countryCode | 3 |
countries | name | countryName | 32 |
datacheckErrors | route | root | 32 |
datacheckErrors | label1 | label | 50 |
datacheckErrors | label2 | label | 20 |
datacheckErrors | label3 | label | 20 |
datacheckErrors | code | dcErrCode | 22 |
datacheckErrors | value | dcErrValue | 32 |
graphTypes | category | graphCategory | 12 |
graphTypes | descr | graphDescr | 100 |
graphTypes | longDescr | N/A | N/A |
graphs | filename | graphFilename | 32 |
graphs | descr | graphDescr | 100 |
graphs | format | graphFormat | 10 |
graphs | category | graphCategory | 12 |
overallMileageByRegion | region | regionCode | 8 |
regions | code | regionCode | 8 |
regions | name | regionName | 48 |
regions | country | countryCode | 3 |
regions | continent | continentCode | 3 |
regions | regiontype | regiontype | 32 |
routes | systemName | systemName | 10 |
routes | region | regionCode | 8 |
routes | route | route | 16 |
routes | banner | banner | 6 |
routes | abbrev | abbrev | 3 |
routes | city | ? | 100 |
routes | root | root | 32 |
segments | root | root | 32 |
systemMileageByRegion | systemName | systemName | 10 |
systemMileageByRegion | region | regionCode | 8 |
systemUpdates | date | date | 10 |
systemUpdates | region | regionExtDescr regionLongDescr countryRegion or...? |
48 |
systemUpdates | systemName | systemName | 10 |
systemUpdates | description | systemFullName | 128 |
systemUpdates | statusChange | statusChange | 16 |
systems | systemName | systemName | 10 |
systems | countryCode | countryCode | 3 |
systems | fullName | systemFullName | 60 |
systems | color | color | 16 |
systems | level | level | 10 |
updates | date | date | 10 |
updates | region | regionExtDescr regionLongDescr countryRegion or...? |
60 |
updates | route | routeLongName | 80 |
updates | root | root | 32 |
updates | description | updateText | 1024 |
waypoints | pointName | label | 20 |
waypoints | root | root | 32 |
label
datacheckErrors.label1
was expanded a bit after the VISIBLE_HIDDEN_COLOC datacheck was first introduced, which at the time used long graph vertex style labels. It's since been changed to list the waypoint label used in the flagged route, so this field's length can be reduced to match the other waypoint label fields.
These are currently limited to 20 characters. I've uncovered an 18-character label so far, with a shell script still chugging away. While that one deserves a forum post, it's possible a contributor may use a longer label in a hidden waypoint for note-taking purposes. Which is legitimate, IMO. Reaching for the outer limits of what we may expect for a visible label, I come up with CHIH1604WAltBypCla(410B)_N, at 26 characters.
(I can only hope that should we ever come across a "South Dear Leader His Majestic Royal Highness The First And Foremost Exalted Grandest Of The Grand Poobah Among Poobahs Sir Comrade Charles Christian 'Otto Yamamoto' Slater Junior Esquire Boulevard West" that needs a waypoint, we'd name it ChaSlaBlvd or OttoYamBlvd and not... uhh... yeah.)
To be on the safe side & provide a bit of headroom, shall we expand the waypoint label fields (datacheckErrors.label1
, datacheckErrors.label2
, datacheckErrors.label3
, waypoints.pointName
) to 26 characters?
regionExtDescr, regionLongDescr, countryRegion... ?
I'm not sure what to call this. Any preferences or suggestions? As seen in the updates
and systemUpdates
tables, these often take the form of "(countryName) regionName".
countryName of 32 + regionName of 48 + 3 more chars for parentheses and a space = 83.
Shall we expand the systemUpdates.region
and updates.region
to 83 characters?
route
Not to be confused with where a DB field named route
contains a root. (Maybe call it routeName?)
16 characters ought to be fine. The two 12-char examples I've seen (there may be others) appear to defy naming conventions anyway.
systemFullName
systemUpdates.description
contains full highway system names. Some that are no longer in use, but the same basic stuff. I see no need to use a different constant from systems.fullName
. The longest full system name right now is 56 characters: Île-de-France Routes Départementales (Seine-Saint-Denis)
.
Limit both fields to 60 characters? Add a few more chars' headroom to be on the safe side?
connectedRoutes.groupName
and routes.city
often but not always match. The one is based off the other; they're the same basic idea. I see no reason to use different constants for these two fields. Any good suggestions on what to call this one?
routeLongName
The longest current example is 77 chars out of 80 allowed: Skorostnaya Avtomobil'naya Doroga Moskva - Sankt Peterburg (Vyshny Volochyok)
.
Add a few more chars' headroom to be on the safe side?
Because Globals Are Bad, we could have these be static members of a class to avoid having to pass around another data structure between functions. Just do
sqlfile << "CREATE TABLE continents (code VARCHAR(" << dbfieldlength::continentCode
<< "), name VARCHAR(" << dbfieldlength::continentName
<< "), PRIMARY KEY(code));\n";
and off we go.
Less familiar with this, but it looks like we can do static class constants too.
I don't see any great reason to expand the ones that have examples almost at the limit, but I also wouldn't object. These are small items by comparison to many of the others in the DB.
I don't see any great reason to expand the ones that have examples almost at the limit, but I also wouldn't object. These are small items by comparison to many of the others in the DB.
Let's leave these as they are, then. We can expand them if we ever need to in the future. It's one of the things the constants are meant to facilitate.
CHIH1604WAltBypCla(410B)_N
.Saint Barthelemy (Overseas Collectivity)
) that don't actually show up in a single DB field.systemUpdates
table, which should not be a problem.connectedRoutes.groupName
and routes.city
.In some cases we won't need to check for valid length because our input must match data that's already been verified. For example, matching ConnectedRoute roots to an existing Route object's root field.
over-length labels will be implemented as a standard datacheck, LABEL_TOO_LONG. ToDo: prevent datacheck.php from displaying error codes.
continue
statements are set up. Fix.{
in place of }
spotted. Look for more.fields[1]
maxes out at DBFieldLength::graphFilename - 14a.base + to_string(a.r)
maxes out at DBFieldLength::graphFilename - 19std::to_string
/ str
calls:
char*
methodel.add_error
for unexpected field counts
delete
/return
combos?
How is this stuff handled & checked for validity once the ctor returns?
(Should all be fine, but best to be safe.)
level
is assigned a valuetype
is assigned a value, and country & continent are foundCREATE TABLE
lines in TravelMapping.sql: before & after, Python & C++except
ions in C++. ~Verbatim?~...
siteupdate.py approx. line |
try to open |
---|---|
992 | wpt file |
1171 | chopped CSV |
1182 | _con CSV |
2236 | continents.csv |
2260 | countries.csv |
2284 | regions.csv |
2337 | systems.csv |
"(countryName) regionName". countryName of 32 + regionName of 48 + 3 more chars for parentheses and a space = 83. Shall we expand the
systemUpdates.region
andupdates.region
to 83 characters?
Similarly, datacheckErrors.value
for VISIBLE_HIDDEN_COLOC cases is root@label.
root of 32 + label of 26 + 1 more char for @
= 59.
Instead of hard-coding these constants, we can set them relative to the other constants.
It's possible for HighwayData contributors to add data to areagraphs.csv, multiregion.csv and multisystem.csv that would result in a filename over the limit of what the DB can store.
Ideally, we'd add to the ErrorList
and abort.
Except that we check ErrorList
and abort before Setting up for graphs of highway data.
This was intentional -- @jteresco wrote in #162:
Recent changes have meant that when a data error slips in, the "ABORTING due to N errors" doesn't happen until after (the time consuming) generation of all the subgraphs. I had gotten used to being able to ignore the site update process once graph generation started. I'd like to see about getting that ABORTING condition moved back earlier so I have a better chance to notice the problem.
The subgraph speed optimizations started later that month, and graph generation is now much less time consuming -- 2m27s (including traveled graphs!) vice 21m40s.
My hope is that this is a tolerable amount of time for the ABORTING condition to be moved back after graph generation to catch the remaining errors. I'd prefer this over the couple of alternatives I can think of.
Yes, definitely can move after the graphs now to get the more complete error list before aborting. One possibility that is probably not worth the trouble is once any error has been detected, we skip things like writing of the actual files.
Skipping all area, multiregion & multisystem graphs would save us a total of 13.2s (Python/noreaster). Skipping only whichever new ones cause the error would most likely be an even smaller difference. I can see the utility in generating a desired file, just with a too-long name, and just skipping writing the DB. Call it "not worth the trouble"?
The longest full system name right now is 56 characters:
Île-de-France Routes Départementales (Seine-Saint-Denis)
.
I think when I ran LongFields.py, I had an old head checked out (the one I use for speed tests) that predates frabfcd70;FRA;Bourgogne-Franche-Comté Routes Départementales (Haute-Saône);yellow;5;devel
.
Interestingly, LongFields.py reports this as 60 characters, seemingly right up against the 60 char DB limit. But it turns out this has 3 multi-byte characters, making it total of 63 bytes.
The revised C++ program correctly reports this as oversized. I'm still working out some weird runtime bugs with that version, and have not tested the new siteupdate.py yet. It'll be interesting to see what it has to say about this when I run it.
There are a couple more systems still commented out with longer names:
63 Bourgogne-Franche-Comté Routes Départementales (Côte-d’Or)
63 Bourgogne-Franche-Comté Routes Départementales (Haute-Saône)
66 Bourgogne-Franche-Comté Routes Départementales (Saône-et-Loire)
72 Bourgogne-Franche-Comté Routes Départementales (Territoire de Belfort)
My first thought was to make a stopgap expansion of systems.fullName
pending sorting out the rest of this issue, but closer inspection reveals many Routes Départementales systems' names to be questionable and able to be shortened. I'll take this to the forum.
Just queried the DB on lab2; turns out that Bourgogne-Franche-Comté Routes Départementales (Haute-Saône)
, at 60 characters, can in fact fit in the DB.
Regardless of how the Python version behaves, this complicates the C++ version quite a bit, as the "error" it's reporting is a false positive.
Fun fun. I think I'll take a break & spend my next couple days of quarantine getting reacquainted with my classic video game systems.
Just queried the DB on lab2; turns out that
Bourgogne-Franche-Comté Routes Départementales (Haute-Saône)
, at 60 characters, can in fact fit in the DB.
I'm guessing this changed some time between MySQL 5.7 & 8.0, along with the transition from 5.7 accepting & truncating overlength values, to 8.0 stopping with an error message...
ERROR 1406 (22001) at line 726: Data too long for column 'fullName' at row 362
(the offending datum being Finland Seututiet (outlined - temp preview to fix active routes)
)I'm bummed that I didn't take a screenshot of travelmapping.net when things were messed up, but ISTR the ñ
being displayed as a �, as in https://github.com/TravelMapping/Web/issues/195. Here's how it appears on lab2:
Seems the thing to do is plan for the most restrictive circumstances, that we're limited by byte counts rather than character counts. Pro: We don't have to get the C++ version to do character counts. Con: We do have to get the Python version to do byte counts.
Edit: OK, it's pretty straightforward:
>>> print(len('ñ'.encode('utf-8')))
2
len('ñ')
to len('ñ'.encode('utf-8'))
checks.OK, finally about ready to go on this. Considering all the small things to watch out for & fix that kept popping up, it'd be wise to break one of everything in HighwayData and make sure everything behaves as expected.
Since the C++ version is all about speed, it writes the .sql file in the background during the graph generation process. Thus we have almost a complete file by the time we reach the ABORTING condition. All that's left are datacheckErrors, graphTypes & graphs. This is sometimes reported as taking 0.1s to write, so I just decided to finish it off before terminating. Not much harm in a junk .sql file hanging around, and it's there completely if you wanna do a post-mortem grep. And who knows, depending on the errors involved, it may even be possible to manually ingest it into a lab2 with no, or even a noreaster with minimal, ill effects. What matters is that the program terminates with an exit code of 1, so localupdate.sh knows to not continue.
A final thought:
How about checking for unrecognized colors in systems.csv? We could have a set of the "official TM" colors, and check each systems.csv datum for membership in this set, catching potential misspellings etc. Just that DBFieldLength
is no longer the best class name for all our constants. Ideas on a better name? Thoughts?
A thought on colors, probably overkill. We could go a bit further, having a colors.csv that defines names and RGB values for the "official" TM colors, introducing a new DB table to store them, the systems table would refer to those entries, and the front-end code would load up its set of colors from that new DB table.
I don't see a big need to specify RGB values, as we can already override them, albeit on a per-tier rather than per-color basis. I can happily agree to "probably overkill", and leave this feature out. It'll be good to get this enhancement out the door.
Before the big stress test gets underway, while I'm in the CSV parsing routes I'm giving the C++ ones a much-needed cleanup, and tightening a couple other minor screws. After, there will be at least one merge conflict to resolve on my end.
There's a remote chance that the MALFORMED_LAT & MALFORMED_LON datachecks can produce overlength info values. I included checks for this, but the Python version is not 100% foolproof. Rather than spend time on a fix and then gunk up the commit history with obsolete code that will just be overwritten again in a few days, I think I'll scrap these changes (in both languages) & go no-build pending a cleanup per #299.
cd ~/TravelMapping/yakra/DataProcessing/siteupdate/cplusplus/
g++ siteupdate.cpp -std=c++11 -pthread -o siteupdate_const 2>&1
./siteupdate_const -t 4 -l logs -n nmp_merged -c stats -g graphs -u /home/yakra/TravelMapping/UserData/list_files -w /home/yakra/TravelMapping/HighwayData | tee siteupdate_m1.log
grep -n ABORTING siteupdate_m1.log; AbortLineCPP=$(grep -n ABORTING siteupdate_m1.log | cut -f1 -d:)
tail -n +$AbortLineCPP siteupdate_m1.log | less
cd ../python-teresco
./siteupdate.py -l logs -n nmp_merged -c stats -g graphs -u /home/yakra/TravelMapping/UserData/list_files -w /home/yakra/TravelMapping/HighwayData -t 1 | tee siteupdate_s1.log
grep -n ABORTING siteupdate_s1.log; AbortLinePY=$(grep -n ABORTING siteupdate_s1.log | cut -f1 -d:)
tail -n +$AbortLinePY siteupdate_s1.log | less
cd ..
diff <(tail -n +`expr $AbortLineCPP + 1` cplusplus/siteupdate_m1.log | sed 's~^[0-9]\+: ~~' | head -n 40) <(tail -n +`expr $AbortLinePY + 1` python-teresco/siteupdate_s1.log | sed 's~^[0-9]\+: ~~' | head -n 40)
# look up country and continent, add index into those arrays
# in case they're needed for lookups later (not needed for DB)
for i in range(len(countries)):
country = countries[i][0]
if country == fields[2]:
fields.append(i)
break
if len(fields) != 6:
el.add_error("Could not find country matching regions.csv line: " + line)
continue
for i in range(len(continents)):
continent = continents[i][0]
if continent == fields[3]:
fields.append(i)
break
if len(fields) != 7:
el.add_error("Could not find continent matching regions.csv line: " + line)
continue
I've retained the lookups for matching countries & continents. Any objections if I nix storing the indices as part of the fields
array?
if len(line) == 0 or line.startswith('#'):
~if (line.empty() || line[0] == '#')
~HighwayData
breakages are as expected. I know I whiffed at least 4 of'em.rg_str
part of the Route
object; use it to load & process WPTs
HighwaySystem
ctor accordinglyI've retained the lookups for matching countries & continents. Any objections if I nix storing the indices as part of the
fields
array?
I am not sure why they were there in the first place, so I suppose no harm in getting rid of them.
# in case they're needed for lookups later (not needed for DB)
Sounds like a case of planning for the future, not fully knowing yet what's going to be done with the data.
I did simillar when starting out the C++ version, looking up & storing a pointer the the continent & country std::pair
s. Nothing much is done with them, but the only less RAM-intensive option would have been to include a C char array in the Region object. :)
A system should be put in place where all entries from TM data files that end up in DB fields are checked for length to avoid problems like https://github.com/TravelMapping/Web/issues/195 .
A set of constants should be introduced to the site update program to make sure these values are consistent among all checks and when DB tables are created.