Open michihdeu opened 1 year ago
Ask @yakra 🤣
On Thu, Dec 22, 2022 at 12:08 PM Michael @.***> wrote:
https://forum.travelmapping.net/index.php?topic=5320.msg29991#msg29991
AveBisJorLem label violates the "too many words" rule. https://travelmapping.net/devel/manual/wayptlabels.php#truncate
Thanks to @Markkos1992 https://github.com/Markkos1992 for pointing it out! 😄
What's the algo? 4x capital + lowercase letter combos?
— Reply to this email directly, view it on GitHub https://github.com/TravelMapping/DataProcessing/issues/545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEBP7JESGWVKBVDDDIJT4TTWOSDKVANCNFSM6AAAAAATG56OTY . You are receiving this because you were mentioned.Message ID: @.***>
-- Mark Moore
Civil and Environmental Engineering Virginia Tech Class of 2014 @.*** (804)691-4381
What's the algo? 4x capital + lowercase letter combos?
In its simplest form, yes. This may be the best place to start.
Probably also consider "words" starting with numerals, so that "JohnPaul2ndAve" would get flagged.
Or alternately, just count the number of runs of 1+ lowercase letters?
(Stop searching at an underscore, lest we get FPs such as RobMugRd_Kwe
.)
Pro: Simple enough to understand & implement. Should avoid many false positives.
Con: Avoids true positives, depending on what our definitions are. If just looking for words, defined as "capital + lowercase letter combos" There are other violations of #truncate out there...
Search https://forum.travelmapping.net/index.php?topic=3245 for Too many words
...
An older style for 3-word road names was to have Two truncated words & one initial. E.G., Lisbon Falls Village Rd -> LisFalVRd. This was later deprecated in favor of the rule quoted above. LisFalRd, LFVilRd, etc.
Searching for these cases resulted in a couple pretty epic regexes. I could break down what they're searching for, how & why, but that's probably TLDR. If anyone asks I will. And this is where the FPs really started to fly in. In particular, capital letters in the middle of a bona fide truncation can trip things up. For example, Douglas/LeJeune Connector -> DouLeJCon. My search yielded so many McSomething -> McS cases that I added another regex to the shell command in the forum post to filter them out.
Maybe for these cases -- I'm just brainstorming here -- keep a running count of: • runs of lowercase letters • capital letters, runs of numerals, or maybe even non-lowercase characters, whatever we end up doing... If the lowercase count reaches 4, flag an error. If the loop finishes with a lowercase count of 3, flag if the caps count is >3. Then, looking at just the "3" cases, see what we have for FPs and figure out where to go from there.
OTOH, maybe the simplest option for preserving our sanity is to look the other way for these cases, as they were once allowed in CHM?
LakeOntPkwySpr
& NiaScePkwySpr
labels has always given me a bit of indigestion.
The idea here is that LakeOntPkwy
is a legit route name, truncated in the same way labels are truncated. Add a banner to that and we get LakeOntPkwySpr
. But that's too many words for a label! So like, do we make exceptions for routes in the HB or something? Or just make sure to name routes in a way that will yield acceptable waypoint labels?ISTR this issue came up when @Duke87ofST proposed an Oklahoma Turnpikes system, and it might have gotten some discussion. I'm too lazy to look up the forum thread ATM. :rofl: Then, OklaDOT assigned numbered designations to everything that would have been in that system, they were included in usaok
and the issue became moot.Yep, "To" should not count.... RouABCDEF is also according to our rules which excludes simple "capital letter counting".
Yes. FooABCDEFSt
, with 2 lowercase runs, is also legal per #truncate:
- Pick out one important word besides the road type and use it and the initials of the other words: Martin Luther King Boulevard becomes MLKingBlvd. Two words in total are included in shortened form along with initials of the rest.
Thus we only need to worry about counting caps when there are 3+ lowercase runs.
Part of me would like to see such a datacheck get implemented to fix up as many as we can. But then again, it's not an especially problematic situaiton to have some non-compliant labels hanging around until someone notices. If there are a lot of special cases that will lead to FPs, maybe it's going to be more trouble than it's worth. Go for it if you'd like put it aside if you'd rather not go down this path.
I'd tend to agree that "LakeOntPkwySpr" should be a valid label. To algorithmically exclude this we'd need to treat generic+banner at the end as one word. A list of common generics and common banners wouldn't catch every case of this but would reduce the FP load, so worth a thought.
Re @Duke87ofST's comment:
Possibly more trouble than it's worth. Quite a heavy lift for avoiding a (literal; 5 or fewer AFAIK) handful of FPs.
Another option would be to take a page out of LABEL_SELFREF's book and look at .list names and name_no_abbrev()
s of intersecting/concurrent routes, but that's still not perfect; double-trumpets etc. could mean the appropriate intersecting route isn't found.
Surely simpler to just mark off the few FPs that do occur.
Re @jteresco's comment:
Here's what I'm working with at the moment:
void Waypoint::too_many_words()
{ // too many words in label
size_t lowruns = 0;
size_t others;
const char* c = label.data() + (label[0] == '*');
if (*c == 'T' && c[1] == 'o' && !islower(c[2])) c += 2;
for (others = !islower(*c++); *c && *c != '_' && *c != '(' && *c != '/'; ++c)
{ if (islower(*c)) lowruns += !islower(c[-1]);
else if (isdigit(*c)) others += !isdigit(c[-1]);
else ++others;
if (lowruns == 4) return Datacheck::add(route, label, "", "", "TMW4", "");
}
if (lowruns == 3 & others > 3) Datacheck::add(route, label, "", "", "TMW3", std::to_string(others));
}
The TMW4
and TMW3
error codes are temporary placeholders, useful for filtering on datacheck.php or grepping datacheck.log.
Implementation could in theory be as simple as changing TMW4
to TOO_MANY_WORDS
and commenting out that last line of code w/the TMW3. In practice, that makes a lot of the remaining code unnecessary, and it could be commented out too. :)
I think I'll throw an alpha up on lab2, and post in the forum, including links for each collab based on @michihdeu's post here.
Contributors can check out the errors in their regions and comment on what they see, including any FPs.
More adventurous souls can filter for the TMW3
flavor and do the same.
One final note:
Eagle-eyed C++ hackers will see that, in addition to stopping the search of the label (for (others =
...) at an underscore, we also stop at a slash. This keeps out a few labels like US1AltTrk/841AltTrk
, US31AltBus/431Bus
, N17BypSom/N17BypAge
etc. that while not the prettiest, are still acceptable.
It also keeps out a lot of labels that definitely have too many words, but at that point a separate datacheck targeting https://travelmapping.net/devel/manual/wayptlabels.php#dropnamed would be a better, clearer option.
Even if this doesn't end up as a regular datacheck, the experiment should call some attention to some old labels that should be cleaned up. I think for new systems, the peer review process has probably caught most of these before systems went active, and will continue to do so for preview/devel/future systems.
Even if this doesn't end up as a regular datacheck, the experiment should call some attention to some old labels that should be cleaned up.
To that end, I've thought about temporarily taking out the && *c != '/'
loop break, as a quick-n-dirty way to bring some slashed named waypoint labels to our attention while there's still no proper datacheck for that.
Looking at how the McBug handles a couple individual labels before fixing it.
label | lowruns | words | comments |
---|---|---|---|
L_______ McMMcMRd _^______ |
1 | 1 | for loop begins @ index 1. |
L_______ McMMcMRd __^>____ |
1 | 1 | Iteration 2McM detected; c incremented per conditional. |
L_______ McMMcMRd ____^___ |
2 | 1 | Iteration 3 c incremented again per for loop. Word count is now off due to skipping preceding M . |
<____L__ McMMcMRd _____^__ |
2 | 2 | Iteration 4McM not detected because last is still @ beginning.Word count catches up to where it should be. |
_____<L_ McMMcMRd ______^_ |
2 | 3 | Iteration 5MRd not detected because c[2] is null terminator, not [A-Z0-9] .Something to fix? Not a big deal in English, but what about French? |
______L_ McMMcMRd _______^ |
3 | 3 | Final iteration (6); loop ends. |
This is of course a clear FP that should be excluded. 3 words, 3 lowercase runs. We got the right result for the wrong reasons.
label | lowruns | words | comments |
---|---|---|---|
L__________ LakeCStRPRd _^_________ |
1 | 1 | for loop begins @ index 1. |
<___L______ LakeCStRPRd ____^______ |
1 | 2 | Iteration 4 |
____L______ LakeCStRPRd _____^_>___ |
1 | 2 | Iteration 5CSt detected; c += 2 per conditional.lowruns not incremented. Oops. |
____<___L__ LakeCStRPRd ________^__ |
1 | 3 | Iteration 6 c incremented again per for loop. Word count is now off due to skipping preceding R . |
________<L_ LakeCStRPRd _________^_ |
1 | 4 | Iteration 7PRd not detected because c[2] is null terminator, not [A-Z0-9] .Something to fix? Not a big deal in English, but what about French? As a result, word count catches up to where it should be. |
_________L_ LakeCStRPRd __________^ |
2 | 4 | Final iteration (8); loop ends. |
A true error. With the DC behaving as intended, 4 "words" will be detected: Lake, CSt, R & PRd. Those 3-letter ones are really 2 words smushed together, D'Escousse -> DEs style. Even with the bug fixed, the DC will still be imperfect.
The word count is, again, right for the wrong reasons.
This error is improperly excluded because lowruns
failed to reach 3.
Will there be any diffs from labels starting lowercase?
Does it matter? Labels should not start with lowercase: #488
It doesn't matter, really. Not for the datacheck. It's just a question to satisfy my own curiosity during development. Having this on my mind, I did some grepping, and came up with the same stuff as in #488 plus 2 new ones:
FRA-BRE/frabred35/frabre.d006435.wpt:rueEgl http://www.openstreetmap.org/?lat=48.633568&lon=-2.107369
FRA-OCC/fraoccd31/fraocc.d005731lab.wpt:rueJeanIng http://www.openstreetmap.org/?lat=43.517420&lon=1.502059
I'd forgotten that issue existed; thanks for bringing it back to my attention.
Dropping this here for reference purposes, "saving my place" if you will. It's the 1st attempt at "McDonald/D'Escousse detection", with the bug (as explored above) in place, before implementing a fix.
void Waypoint::too_many_words()
{ // too many words in label
size_t lowruns = 0;
size_t words;
const char* c = label.data() + (label[0] == '*');
if (*c == 'T' && c[1] == 'o' && !islower(c[2])) c += 2;
const char* last = c;
for (words = !islower(*c++); *c && *c != '_' && *c != '(' && *c != '/'; ++c)
{ if (islower(*c)) lowruns += !islower(c[-1]);
else if (isdigit(*c))
{ if (!isdigit(c[-1]))
{ ++words;
last = c;
}
} // v~~ These next 2 lines look like the bug. Don't increment c, or do set words/last/lowruns.
else { if ( c == last+2 && (isupper(c[1]) || isdigit(c[1])) ) ++c;
else if ( c == last+1 && islower(c[1]) && (isupper(c[2]) || isdigit(c[2])) ) c += 2;
else { ++words;
last = c;
}
}
if (lowruns == 4) return Datacheck::add(route, label, "", "", "TMW4", "");
}
if (lowruns == 3 & words > 3) Datacheck::add(route, label, "", "", "TMW3", std::to_string(words));
}
https://forum.travelmapping.net/index.php?topic=5320.msg29991#msg29991
AveBisJorLem
label violates the "too many words" rule. https://travelmapping.net/devel/manual/wayptlabels.php#truncateThanks to @Markkos1992 for pointing it out! 😄
What's the algo? 4x capital + lowercase letter combos?