Data check for wps with "too many words" (> 3)

michihdeu commented 1 year ago

https://forum.travelmapping.net/index.php?topic=5320.msg29991#msg29991

AveBisJorLem label violates the "too many words" rule. https://travelmapping.net/devel/manual/wayptlabels.php#truncate

Thanks to @Markkos1992 for pointing it out! 😄

What's the algo? 4x capital + lowercase letter combos?

Markkos1992 commented 1 year ago

Ask @yakra 🤣

On Thu, Dec 22, 2022 at 12:08 PM Michael @.***> wrote:

https://forum.travelmapping.net/index.php?topic=5320.msg29991#msg29991

AveBisJorLem label violates the "too many words" rule. https://travelmapping.net/devel/manual/wayptlabels.php#truncate

Thanks to @Markkos1992 https://github.com/Markkos1992 for pointing it out! 😄

What's the algo? 4x capital + lowercase letter combos?

— Reply to this email directly, view it on GitHub https://github.com/TravelMapping/DataProcessing/issues/545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEBP7JESGWVKBVDDDIJT4TTWOSDKVANCNFSM6AAAAAATG56OTY . You are receiving this because you were mentioned.Message ID: @.***>

-- Mark Moore

Civil and Environmental Engineering Virginia Tech Class of 2014 @.*** (804)691-4381

yakra commented 1 year ago

What's the algo? 4x capital + lowercase letter combos?

In its simplest form, yes. This may be the best place to start. Probably also consider "words" starting with numerals, so that "JohnPaul2ndAve" would get flagged. Or alternately, just count the number of runs of 1+ lowercase letters? (Stop searching at an underscore, lest we get FPs such as RobMugRd_Kwe.) Pro: Simple enough to understand & implement. Should avoid many false positives. Con: Avoids true positives, depending on what our definitions are. If just looking for words, defined as "capital + lowercase letter combos" There are other violations of #truncate out there...

Search https://forum.travelmapping.net/index.php?topic=3245 for Too many words...

An older style for 3-word road names was to have Two truncated words & one initial. E.G., Lisbon Falls Village Rd -> LisFalVRd. This was later deprecated in favor of the rule quoted above. LisFalRd, LFVilRd, etc.

Searching for these cases resulted in a couple pretty epic regexes. I could break down what they're searching for, how & why, but that's probably TLDR. If anyone asks I will. And this is where the FPs really started to fly in. In particular, capital letters in the middle of a bona fide truncation can trip things up. For example, Douglas/LeJeune Connector -> DouLeJCon. My search yielded so many McSomething -> McS cases that I added another regex to the shell command in the forum post to filter them out.

Maybe for these cases -- I'm just brainstorming here -- keep a running count of: • runs of lowercase letters • capital letters, runs of numerals, or maybe even non-lowercase characters, whatever we end up doing... If the lowercase count reaches 4, flag an error. If the loop finishes with a lowercase count of 3, flag if the caps count is >3. Then, looking at just the "3" cases, see what we have for FPs and figure out where to go from there.

OTOH, maybe the simplest option for preserving our sanity is to look the other way for these cases, as they were once allowed in CHM?

Gray areas & potential FPs

The fact that I have LakeOntPkwySpr & NiaScePkwySpr labels has always given me a bit of indigestion. The idea here is that LakeOntPkwy is a legit route name, truncated in the same way labels are truncated. Add a banner to that and we get LakeOntPkwySpr. But that's too many words for a label! So like, do we make exceptions for routes in the HB or something? Or just make sure to name routes in a way that will yield acceptable waypoint labels?ISTR this issue came up when @Duke87ofST proposed an Oklahoma Turnpikes system, and it might have gotten some discussion. I'm too lazy to look up the forum thread ATM. :rofl: Then, OklaDOT assigned numbered designations to everything that would have been in that system, they were included in usaok and the issue became moot.
Stuff like ToBakWilPkwy is a gray area. If we're using trailblazer labels, we're not doing a truncated, visible cross road name, are we? To what degree do those rules even take effect? Does the word count limit apply before or after truncation?

michihdeu commented 1 year ago

Yep, "To" should not count.... RouABCDEF is also according to our rules which excludes simple "capital letter counting".

yakra commented 1 year ago

Yes. FooABCDEFSt, with 2 lowercase runs, is also legal per #truncate:

Pick out one important word besides the road type and use it and the initials of the other words: Martin Luther King Boulevard becomes MLKingBlvd. Two words in total are included in shortened form along with initials of the rest.

Thus we only need to worry about counting caps when there are 3+ lowercase runs.

[x] We could skip over "To" unless the next char is lowercase.

jteresco commented 1 year ago

Part of me would like to see such a datacheck get implemented to fix up as many as we can. But then again, it's not an especially problematic situaiton to have some non-compliant labels hanging around until someone notices. If there are a lot of special cases that will lead to FPs, maybe it's going to be more trouble than it's worth. Go for it if you'd like put it aside if you'd rather not go down this path.

Duke87ofST commented 1 year ago

I'd tend to agree that "LakeOntPkwySpr" should be a valid label. To algorithmically exclude this we'd need to treat generic+banner at the end as one word. A list of common generics and common banners wouldn't catch every case of this but would reduce the FP load, so worth a thought.

yakra commented 1 year ago

Re @Duke87ofST's comment: Possibly more trouble than it's worth. Quite a heavy lift for avoiding a (literal; 5 or fewer AFAIK) handful of FPs. Another option would be to take a page out of LABEL_SELFREF's book and look at .list names and name_no_abbrev()s of intersecting/concurrent routes, but that's still not perfect; double-trumpets etc. could mean the appropriate intersecting route isn't found. Surely simpler to just mark off the few FPs that do occur.

Re @jteresco's comment:

Here's what I'm working with at the moment:

void Waypoint::too_many_words()
{   // too many words in label
    size_t lowruns = 0;
    size_t others;
    const char* c = label.data() + (label[0] == '*');
    if (*c == 'T' && c[1] == 'o' && !islower(c[2])) c += 2;
    for (others = !islower(*c++); *c && *c != '_' && *c != '(' && *c != '/'; ++c)
    {   if  (islower(*c))   lowruns += !islower(c[-1]);
        else if (isdigit(*c))   others  += !isdigit(c[-1]);
        else ++others;
        if (lowruns == 4) return Datacheck::add(route, label, "", "", "TMW4", "");
    }
    if (lowruns == 3 & others > 3) Datacheck::add(route, label, "", "", "TMW3", std::to_string(others));
}

The TMW4 and TMW3 error codes are temporary placeholders, useful for filtering on datacheck.php or grepping datacheck.log.

TMW4: The simpler "4+ runs of lowercase letters" method. Very few FPs.
TMW3: This tries to capture AliBobCRd , AliBCarlRd, ABobCarlRd & the like, though there's more potential for FPs here. This one's more experimental. Early development. More work to do.

Implementation could in theory be as simple as changing TMW4 to TOO_MANY_WORDS and commenting out that last line of code w/the TMW3. In practice, that makes a lot of the remaining code unnecessary, and it could be commented out too. :)

I think I'll throw an alpha up on lab2, and post in the forum, including links for each collab based on @michihdeu's post here. Contributors can check out the errors in their regions and comment on what they see, including any FPs. More adventurous souls can filter for the TMW3 flavor and do the same.

One final note: Eagle-eyed C++ hackers will see that, in addition to stopping the search of the label (for (others =...) at an underscore, we also stop at a slash. This keeps out a few labels like US1AltTrk/841AltTrk, US31AltBus/431Bus, N17BypSom/N17BypAge etc. that while not the prettiest, are still acceptable. It also keeps out a lot of labels that definitely have too many words, but at that point a separate datacheck targeting https://travelmapping.net/devel/manual/wayptlabels.php#dropnamed would be a better, clearer option.

jteresco commented 1 year ago

Even if this doesn't end up as a regular datacheck, the experiment should call some attention to some old labels that should be cleaned up. I think for new systems, the peer review process has probably caught most of these before systems went active, and will continue to do so for preview/devel/future systems.

yakra commented 1 year ago

Even if this doesn't end up as a regular datacheck, the experiment should call some attention to some old labels that should be cleaned up.

To that end, I've thought about temporarily taking out the && *c != '/' loop break, as a quick-n-dirty way to bring some slashed named waypoint labels to our attention while there's still no proper datacheck for that.

yakra commented 1 year ago

Looking at how the McBug handles a couple individual labels before fixing it.

label	lowruns	words	comments
`L_______` `McMMcMRd` `_^______`	1	1	for loop begins @ index 1.
`L_______` `McMMcMRd` `__^>____`	1	1	Iteration 2 `McM` detected; c incremented per conditional.
`L_______` `McMMcMRd` `____^___`	2	1	Iteration 3 c incremented again per for loop. Word count is now off due to skipping preceding `M`.
`<____L__` `McMMcMRd` `_____^__`	2	2	Iteration 4 `McM` not detected because `last` is still @ beginning. Word count catches up to where it should be.
`_____<L_` `McMMcMRd` `______^_`	2	3	Iteration 5 `MRd` not detected because c[2] is null terminator, not `[A-Z0-9]`. Something to fix? Not a big deal in English, but what about French?
`______L_` `McMMcMRd` `_______^`	3	3	Final iteration (6); loop ends.

This is of course a clear FP that should be excluded. 3 words, 3 lowercase runs. We got the right result for the wrong reasons.

label	lowruns	words	comments
`L__________` `LakeCStRPRd` `_^_________`	1	1	for loop begins @ index 1.
`<___L______` `LakeCStRPRd` `____^______`	1	2	Iteration 4
`____L______` `LakeCStRPRd` `_____^_>___`	1	2	Iteration 5 `CSt` detected; c += 2 per conditional. lowruns not incremented. Oops.
`____<___L__` `LakeCStRPRd` `________^__`	1	3	Iteration 6 c incremented again per for loop. Word count is now off due to skipping preceding `R`.
`________<L_` `LakeCStRPRd` `_________^_`	1	4	Iteration 7 `PRd` not detected because c[2] is null terminator, not `[A-Z0-9]`. Something to fix? Not a big deal in English, but what about French? As a result, word count catches up to where it should be.
`_________L_` `LakeCStRPRd` `__________^`	2	4	Final iteration (8); loop ends.

A true error. With the DC behaving as intended, 4 "words" will be detected: Lake, CSt, R & PRd. Those 3-letter ones are really 2 words smushed together, D'Escousse -> DEs style. Even with the bug fixed, the DC will still be imperfect. The word count is, again, right for the wrong reasons. This error is improperly excluded because lowruns failed to reach 3.

[ ] Init word count @ 1. Will there be any diffs from labels starting lowercase?
[ ] Null terminator fix

michihdeu commented 1 year ago

Will there be any diffs from labels starting lowercase?

Does it matter? Labels should not start with lowercase: #488

yakra commented 1 year ago

It doesn't matter, really. Not for the datacheck. It's just a question to satisfy my own curiosity during development. Having this on my mind, I did some grepping, and came up with the same stuff as in #488 plus 2 new ones:

FRA-BRE/frabred35/frabre.d006435.wpt:rueEgl http://www.openstreetmap.org/?lat=48.633568&lon=-2.107369
FRA-OCC/fraoccd31/fraocc.d005731lab.wpt:rueJeanIng http://www.openstreetmap.org/?lat=43.517420&lon=1.502059

I'd forgotten that issue existed; thanks for bringing it back to my attention.

yakra commented 1 year ago

Dropping this here for reference purposes, "saving my place" if you will. It's the 1st attempt at "McDonald/D'Escousse detection", with the bug (as explored above) in place, before implementing a fix.

void Waypoint::too_many_words()
{   // too many words in label
    size_t lowruns = 0;
    size_t words;
    const char* c = label.data() + (label[0] == '*');
    if (*c == 'T' && c[1] == 'o' && !islower(c[2])) c += 2;
    const char* last = c;
    for (words = !islower(*c++); *c && *c != '_' && *c != '(' && *c != '/'; ++c)
    {   if  (islower(*c))   lowruns += !islower(c[-1]);
        else if (isdigit(*c))
             {  if (!isdigit(c[-1]))
            {   ++words;
                last = c;
            }
             }  //  v~~ These next 2 lines look like the bug. Don't increment c, or do set words/last/lowruns.
        else {  if  ( c == last+2 && (isupper(c[1]) || isdigit(c[1])) ) ++c;
            else if ( c == last+1 && islower(c[1]) && (isupper(c[2]) || isdigit(c[2])) ) c += 2;
            else {  ++words;
                last = c;
                 }
             }
        if (lowruns == 4) return Datacheck::add(route, label, "", "", "TMW4", "");
    }
    if (lowruns == 3 & words > 3) Datacheck::add(route, label, "", "", "TMW3", std::to_string(words));
}

TravelMapping / DataProcessing

Data check for wps with "too many words" (> 3) #545

Gray areas & potential FPs