TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
4 stars 6 forks source link

Data check for wps with "too many words" (> 3) #545

Open michihdeu opened 1 year ago

michihdeu commented 1 year ago

https://forum.travelmapping.net/index.php?topic=5320.msg29991#msg29991

AveBisJorLem label violates the "too many words" rule. https://travelmapping.net/devel/manual/wayptlabels.php#truncate

Thanks to @Markkos1992 for pointing it out! 😄

What's the algo? 4x capital + lowercase letter combos?

Markkos1992 commented 1 year ago

Ask @yakra 🤣

On Thu, Dec 22, 2022 at 12:08 PM Michael @.***> wrote:

https://forum.travelmapping.net/index.php?topic=5320.msg29991#msg29991

AveBisJorLem label violates the "too many words" rule. https://travelmapping.net/devel/manual/wayptlabels.php#truncate

Thanks to @Markkos1992 https://github.com/Markkos1992 for pointing it out! 😄

What's the algo? 4x capital + lowercase letter combos?

— Reply to this email directly, view it on GitHub https://github.com/TravelMapping/DataProcessing/issues/545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEBP7JESGWVKBVDDDIJT4TTWOSDKVANCNFSM6AAAAAATG56OTY . You are receiving this because you were mentioned.Message ID: @.***>

-- Mark Moore

Civil and Environmental Engineering Virginia Tech Class of 2014 @.*** (804)691-4381

yakra commented 1 year ago

What's the algo? 4x capital + lowercase letter combos?

In its simplest form, yes. This may be the best place to start. Probably also consider "words" starting with numerals, so that "JohnPaul2ndAve" would get flagged. Or alternately, just count the number of runs of 1+ lowercase letters? (Stop searching at an underscore, lest we get FPs such as RobMugRd_Kwe.) Pro: Simple enough to understand & implement. Should avoid many false positives. Con: Avoids true positives, depending on what our definitions are. If just looking for words, defined as "capital + lowercase letter combos" There are other violations of #truncate out there...

Search https://forum.travelmapping.net/index.php?topic=3245 for Too many words...

An older style for 3-word road names was to have Two truncated words & one initial. E.G., Lisbon Falls Village Rd -> LisFalVRd. This was later deprecated in favor of the rule quoted above. LisFalRd, LFVilRd, etc.

Searching for these cases resulted in a couple pretty epic regexes. I could break down what they're searching for, how & why, but that's probably TLDR. If anyone asks I will. And this is where the FPs really started to fly in. In particular, capital letters in the middle of a bona fide truncation can trip things up. For example, Douglas/LeJeune Connector -> DouLeJCon. My search yielded so many McSomething -> McS cases that I added another regex to the shell command in the forum post to filter them out.

Maybe for these cases -- I'm just brainstorming here -- keep a running count of: • runs of lowercase letters • capital letters, runs of numerals, or maybe even non-lowercase characters, whatever we end up doing... If the lowercase count reaches 4, flag an error. If the loop finishes with a lowercase count of 3, flag if the caps count is >3. Then, looking at just the "3" cases, see what we have for FPs and figure out where to go from there.

OTOH, maybe the simplest option for preserving our sanity is to look the other way for these cases, as they were once allowed in CHM?


Gray areas & potential FPs

michihdeu commented 1 year ago

Yep, "To" should not count.... RouABCDEF is also according to our rules which excludes simple "capital letter counting".

yakra commented 1 year ago

Yes. FooABCDEFSt, with 2 lowercase runs, is also legal per #truncate:

  1. Pick out one important word besides the road type and use it and the initials of the other words: Martin Luther King Boulevard becomes MLKingBlvd. Two words in total are included in shortened form along with initials of the rest.

Thus we only need to worry about counting caps when there are 3+ lowercase runs.


jteresco commented 1 year ago

Part of me would like to see such a datacheck get implemented to fix up as many as we can. But then again, it's not an especially problematic situaiton to have some non-compliant labels hanging around until someone notices. If there are a lot of special cases that will lead to FPs, maybe it's going to be more trouble than it's worth. Go for it if you'd like put it aside if you'd rather not go down this path.

Duke87ofST commented 1 year ago

I'd tend to agree that "LakeOntPkwySpr" should be a valid label. To algorithmically exclude this we'd need to treat generic+banner at the end as one word. A list of common generics and common banners wouldn't catch every case of this but would reduce the FP load, so worth a thought.

yakra commented 1 year ago

Re @Duke87ofST's comment: Possibly more trouble than it's worth. Quite a heavy lift for avoiding a (literal; 5 or fewer AFAIK) handful of FPs. Another option would be to take a page out of LABEL_SELFREF's book and look at .list names and name_no_abbrev()s of intersecting/concurrent routes, but that's still not perfect; double-trumpets etc. could mean the appropriate intersecting route isn't found. Surely simpler to just mark off the few FPs that do occur.


Re @jteresco's comment:

Here's what I'm working with at the moment:

void Waypoint::too_many_words()
{   // too many words in label
    size_t lowruns = 0;
    size_t others;
    const char* c = label.data() + (label[0] == '*');
    if (*c == 'T' && c[1] == 'o' && !islower(c[2])) c += 2;
    for (others = !islower(*c++); *c && *c != '_' && *c != '(' && *c != '/'; ++c)
    {   if  (islower(*c))   lowruns += !islower(c[-1]);
        else if (isdigit(*c))   others  += !isdigit(c[-1]);
        else ++others;
        if (lowruns == 4) return Datacheck::add(route, label, "", "", "TMW4", "");
    }
    if (lowruns == 3 & others > 3) Datacheck::add(route, label, "", "", "TMW3", std::to_string(others));
}

The TMW4 and TMW3 error codes are temporary placeholders, useful for filtering on datacheck.php or grepping datacheck.log.

Implementation could in theory be as simple as changing TMW4 to TOO_MANY_WORDS and commenting out that last line of code w/the TMW3. In practice, that makes a lot of the remaining code unnecessary, and it could be commented out too. :)

I think I'll throw an alpha up on lab2, and post in the forum, including links for each collab based on @michihdeu's post here. Contributors can check out the errors in their regions and comment on what they see, including any FPs. More adventurous souls can filter for the TMW3 flavor and do the same.


One final note: Eagle-eyed C++ hackers will see that, in addition to stopping the search of the label (for (others =...) at an underscore, we also stop at a slash. This keeps out a few labels like US1AltTrk/841AltTrk, US31AltBus/431Bus, N17BypSom/N17BypAge etc. that while not the prettiest, are still acceptable. It also keeps out a lot of labels that definitely have too many words, but at that point a separate datacheck targeting https://travelmapping.net/devel/manual/wayptlabels.php#dropnamed would be a better, clearer option.

jteresco commented 1 year ago

Even if this doesn't end up as a regular datacheck, the experiment should call some attention to some old labels that should be cleaned up. I think for new systems, the peer review process has probably caught most of these before systems went active, and will continue to do so for preview/devel/future systems.

yakra commented 1 year ago

Even if this doesn't end up as a regular datacheck, the experiment should call some attention to some old labels that should be cleaned up.

To that end, I've thought about temporarily taking out the && *c != '/' loop break, as a quick-n-dirty way to bring some slashed named waypoint labels to our attention while there's still no proper datacheck for that.

yakra commented 1 year ago

Looking at how the McBug handles a couple individual labels before fixing it.


label lowruns words comments
L_______
McMMcMRd
_^______
1 1 for loop begins @ index 1.
L_______
McMMcMRd
__^>____
1 1 Iteration 2
McM detected; c incremented per conditional.
L_______
McMMcMRd
____^___
2 1 Iteration 3
c incremented again per for loop.
Word count is now off due to skipping preceding M.
<____L__
McMMcMRd
_____^__
2 2 Iteration 4
McM not detected because last is still @ beginning.
Word count catches up to where it should be.
_____<L_
McMMcMRd
______^_
2 3 Iteration 5
MRd not detected because c[2] is null terminator, not [A-Z0-9].
Something to fix? Not a big deal in English, but what about French?
______L_
McMMcMRd
_______^
3 3 Final iteration (6); loop ends.

This is of course a clear FP that should be excluded. 3 words, 3 lowercase runs. We got the right result for the wrong reasons.


label lowruns words comments
L__________
LakeCStRPRd
_^_________
1 1 for loop begins @ index 1.
<___L______
LakeCStRPRd
____^______
1 2 Iteration 4
____L______
LakeCStRPRd
_____^_>___
1 2 Iteration 5
CSt detected; c += 2 per conditional.
lowruns not incremented. Oops.
____<___L__
LakeCStRPRd
________^__
1 3 Iteration 6
c incremented again per for loop.
Word count is now off due to skipping preceding R.
________<L_
LakeCStRPRd
_________^_
1 4 Iteration 7
PRd not detected because c[2] is null terminator, not [A-Z0-9].
Something to fix? Not a big deal in English, but what about French?
As a result, word count catches up to where it should be.
_________L_
LakeCStRPRd
__________^
2 4 Final iteration (8); loop ends.

A true error. With the DC behaving as intended, 4 "words" will be detected: Lake, CSt, R & PRd. Those 3-letter ones are really 2 words smushed together, D'Escousse -> DEs style. Even with the bug fixed, the DC will still be imperfect. The word count is, again, right for the wrong reasons. This error is improperly excluded because lowruns failed to reach 3.


michihdeu commented 1 year ago

Will there be any diffs from labels starting lowercase?

Does it matter? Labels should not start with lowercase: #488

yakra commented 1 year ago

It doesn't matter, really. Not for the datacheck. It's just a question to satisfy my own curiosity during development. Having this on my mind, I did some grepping, and came up with the same stuff as in #488 plus 2 new ones:

FRA-BRE/frabred35/frabre.d006435.wpt:rueEgl http://www.openstreetmap.org/?lat=48.633568&lon=-2.107369
FRA-OCC/fraoccd31/fraocc.d005731lab.wpt:rueJeanIng http://www.openstreetmap.org/?lat=43.517420&lon=1.502059

I'd forgotten that issue existed; thanks for bringing it back to my attention.

yakra commented 1 year ago

Dropping this here for reference purposes, "saving my place" if you will. It's the 1st attempt at "McDonald/D'Escousse detection", with the bug (as explored above) in place, before implementing a fix.

void Waypoint::too_many_words()
{   // too many words in label
    size_t lowruns = 0;
    size_t words;
    const char* c = label.data() + (label[0] == '*');
    if (*c == 'T' && c[1] == 'o' && !islower(c[2])) c += 2;
    const char* last = c;
    for (words = !islower(*c++); *c && *c != '_' && *c != '(' && *c != '/'; ++c)
    {   if  (islower(*c))   lowruns += !islower(c[-1]);
        else if (isdigit(*c))
             {  if (!isdigit(c[-1]))
            {   ++words;
                last = c;
            }
             }  //  v~~ These next 2 lines look like the bug. Don't increment c, or do set words/last/lowruns.
        else {  if  ( c == last+2 && (isupper(c[1]) || isdigit(c[1])) ) ++c;
            else if ( c == last+1 && islower(c[1]) && (isupper(c[2]) || isdigit(c[2])) ) c += 2;
            else {  ++words;
                last = c;
                 }
             }
        if (lowruns == 4) return Datacheck::add(route, label, "", "", "TMW4", "");
    }
    if (lowruns == 3 & words > 3) Datacheck::add(route, label, "", "", "TMW3", std::to_string(words));
}