TravelMapping / Web

Web-facing tool/page development
8 stars 26 forks source link

Data error visualization (new definition of critical errors) #390

Open michihdeu opened 4 years ago

michihdeu commented 4 years ago

http://travelmapping.net/devel/datacheck.php?show=DUPLICATE_LABEL

I think that DUPLICATE LABEL errors are the most critical data errors because users cannot use these labels or their stats are falsified.

Most of the DL errors are hidden wps. They could be used by experts but not by normal users. And we should make live as easy as possible for normal users.

I've reported the relevant open DL issues today, and there are really just 3 wps.

The problem is, that there is a huge number of data errros and to see the most important one's, we use the red font color. But the mass of hidden wps is also in red and I don't wanna trigger our hwy data managers for these minor issues....

Suggestion: Can we show DL errors starting with "X plus numberal" in black instead of red to make the real red errors more visible?

https://github.com/TravelMapping/DataProcessing/issues/275 https://github.com/TravelMapping/DataProcessing/issues/278

Maybe it's BS and just the wrong direction to deal with this issue but... feel free to find a better way and... feel free to close this issue...

jteresco commented 4 years ago

Ideas: http://forum.travelmapping.net/index.php?topic=3448.0

yakra commented 4 years ago

Most of the DL errors are hidden wps. They could be used by experts but not by normal users. And we should make live as easy as possible for normal users.

Having trouble following you; what do you mean by "normal users"? When I hear the phrase, I think, regular users of the site; travelers. But this post is about streamlining the datacheck page, which is used by contributors -- people who are by definition the "experts" who can find & use hidden WPs. Otherwise, I don't see how any of this would benefit regular travelers, not site contributors. That quibble aside, though...

I've reported the relevant open DL issues today, and there are really just 3 wps.

Huh? When I filter out "X plus numeral" cases I see 15:

[yakra@noreaster /home/www/tm/logs]$ cat datacheck.log | grep DUPLICATE_LABEL | grep -v '^.*;[Xx][0-9].*;;;DUPLICATE_LABEL;$'
cod.n002;kan;;;DUPLICATE_LABEL;
cod.n004;ben;;;DUPLICATE_LABEL;
cod.n006gem;bod;;;DUPLICATE_LABEL;
cod.n027;s436;;;DUPLICATE_LABEL;
cod.tah8;ben;;;DUPLICATE_LABEL;
egy.m075;h21;;;DUPLICATE_LABEL;
egy.tah1;h01_e;;;DUPLICATE_LABEL;
gab.tah10;kan;;;DUPLICATE_LABEL;
irq.m005;h8;;;DUPLICATE_LABEL;
mar.tah1;n8;;;DUPLICATE_LABEL;
nga.e001;a1;;;DUPLICATE_LABEL;
nga.e001;a1;;;DUPLICATE_LABEL;
nga.e001;a1;;;DUPLICATE_LABEL;
nga.e001;a1;;;DUPLICATE_LABEL;
nga.e001;a1;;;DUPLICATE_LABEL;

The problem is, that there is a huge number of data errros and to see the most important one's, we use the red font color. But the mass of hidden wps is also in red and I don't wanna trigger our hwy data managers for these minor issues....

Well, do we really use red to denote the most important ones? While datacheck.php does say "Errors shown in red should be fixed as soon as possible", in fact everything is red, with just 3 exceptions: https://github.com/TravelMapping/Web/blob/93b0b86d5d4cf770db719adeb398f70a039a8655/devel/datacheck.php#L60-L63 Thus I consider it more of a "the least important items are in black" scenario. These are cases that are likely to not have a fix available, and ultimately just get crossed off as FP as the system moves to active status.

Suggestion: Can we show DL errors starting with "X plus numberal" in black instead of red to make the real red errors more visible?

I don't necessarily think this is a bad idea, I just want to make sure something like this would be well thought through. What about all the other red errors - Items that should be fixed, but don't cause .list parsing errors or errors in people's stats? Such as label errors?

I think what the issue boils down to is that for some of us, such as @michihdeu & myself, addressing datacheck errors is a high priority. We strive to keep our regions & routes free of them, and take care of them quickly after they show up. For others, addressing datacheck errors is simply not a priority. And at the end of the day, no matter how much prodding we provide or how much we try to streamline the process, it won't change that. The majority of the DUPLICATE_LABEL cases date to 2016... :(

https://github.com/TravelMapping/DataProcessing/issues/275

The only real relation to this issue is...

https://github.com/TravelMapping/DataProcessing/issues/278

That one's a proposal for a siteupdate speed increase when Processing traveler list files. The only real relation to this issue is via a reference to # 275. :P ...Or did you mean to reference https://github.com/TravelMapping/DataProcessing/issues/272#issuecomment-566727336, for its discussion on how duplicate labels get mangled by the .list line parser?

Maybe it's BS and just the wrong direction to deal with this issue but... feel free to find a better way and... feel free to close this issue...

I'm not against your proposal per se... though skeptical. I think what this really comes down to is that some managers just don't care too much about cleaning out their datacheck items. Which we've complained about before. :)

Barring a surefire way to change that, the rest of us can meanwhile improve the signal-to-noise ratio for ourselves by using various tools at our disposal, for example

michihdeu commented 4 years ago

Having trouble following you; what do you mean by "normal users"?

Travelers who do not use wpt editor (or HDX) where you see +X wps.

I wanna say that +X wps are only "known" by hwy data managers (or "wpt editor users"). They are aware of +X labels and they could use them in their list files. Normal users not.

You mentioned on the forum that it is sometimes necessary to use X492394 labels in list files. I don't agree. If their is a point request for route A1 but we have no name for it, we could just name it A1_A, the next A1_B etc. But if it is a point in use (the point, not the name), it should be visible in HB. We might also introduce something like Y123456 to distinguish hidden and unnamed labels but I don't think that we really need nor want it.

How many (and which) hidden wps are currently in use by travelers? I could make suggestions for visible wp labels.

I've reported the relevant open DL issues today, and there are really just 3 wps. Huh? When I filter out "X plus numeral" cases I see 15:

Sorry, I meant 3 wps in active and preview systems. Do you really care about data errors for devel systems? There is such a mess.... not worth to look at and not worth to care about since it's in devel per definition.

Thus I consider it more of a "the least important items are in black" scenario.

wording... DL errors for hidden wps are "least important items" to me because they are totally irrelevant - with the exception mentioned above where they are really used in list files and backend usage for marking other data errors FP.

What about all the other red errors - Items that should be fixed, but don't cause .list parsing errors or errors in people's stats? Such as label errors?

I think that errors which are relevant to normal users (travelers who use HB only but not wpt editor nor HDX) should be red. Everything which can cause a broken list file entry, falsify stats or complicate navigating through the routes like NMPs. When label names change, they are relevant but since we have alt labels - yes, they are also less important.

I think what the issue boils down to is that for some of us, such as @michihdeu & myself, addressing datacheck errors is a high priority. We strive to keep our regions & routes free of them, and take care of them quickly after they show up. For others, addressing datacheck errors is simply not a priority.

Exactly! Because I don't wanna bother normal users.

I guess that due to the mass of data errors, some hwy data manager just think it's unimportant because there are also so many other errors....

When I open data check I'd like to see an empty list for active systems. If there are one or two errors, I could remember how long they are there and trigger the hwy data manager on the forum and ask for a fix. How many years are the OR errors there?

It would be ridiculous to open a thread "DL error +X01" please rename it to +X99 or whatever.... and getting a long discussion that it's just a concurrent segment and it's called +X01 on the other routes.... no....

And at the end of the day, no matter how much prodding we provide or how much we try to streamline the process, it won't change that.

If there are only very few red errors for active systems, we can keep an eye on it and report it on the forum. That's the difference. Maybe we could educate them to check it by themselves (but I doubt)

Again, in the end (when the exception with usage in list files would be eliminated), hidden wps are just relevant for marking other data errors FP.

I just mentioned https://github.com/TravelMapping/DataProcessing/issues/275 and https://github.com/TravelMapping/DataProcessing/issues/278 to get the reference there because there was some similar discussion (but not that similar, never mind)

yakra commented 4 years ago

I'll try to stay on topic & avoid replying to the earlier bits. :)

I think that errors which are relevant to normal users (travelers who use HB only but not wpt editor nor HDX) should be red. Everything which can cause a broken list file entry, falsify stats or complicate navigating through the routes like NMPs. When label names change, they are relevant but since we have alt labels - yes, they are also less important.

With the current red/black dividing line at VISIBLE_DISTANCE, LONG_SEGMENT, and SHARP_ANGLE, it seems the criterion for black is roughly "A lot of these are likely to just be marked FP once the system goes active." This does seem a more useful way to sort & show info prioritizing errors.

To break it all down by error type (thinking of "navigating" primarily as using the "Intersecting/Concurrent Routes" feature):

error code proposed
new color
broken list
file entry
falsify
stats
complicate
navigating
comments
BAD_ANGLE Red no yes yes A subset of DUPLICATE_COORDS.
BUS_WITH_I Black no no no
DUPLICATE_COORDS Red no yes yes Can falsify/complicate in true positive cases.
DUPLICATE_LABEL Red visible
Black hidden
yes yes no Hidden points are arguably less important: rare potential for use, by power users only.
HIDDEN_JUNCTION Red no yes yes Broken concurrencies can falsify stats.Even FPs can potentially complicate navigating.
HIDDEN_TERMINUS Red sort of yes yes Prevents getting a proper list entry for fully clinched route.
INVALID_FINAL_CHAR Black no no no
INVALID_FIRST_CHAR Black no no no
LABEL_INVALID_CHAR Red sort of no no “Breaks” lists inasmuch as we’d wanna discourage non-ascii characters.I guess make it red & encourage fixing these ASAP before anything gets used in a .list?
LABEL_LOOKS_HIDDEN Black no no no
LABEL_PARENS Black no no no
LABEL_SELFREF Black no no no
LABEL_SLASHES Black no no no
LABEL_UNDERSCORES Black no no no
LACKS_GENERIC Black no no no
LONG_SEGMENT Black no no no
LONG_UNDERSCORE Black no no no
MALFORMED_LAT
MALFORMED_LON
MALFORMED_URL
Red sort of yes yes Can break lists if a waypoint is OK in an earlier version of the file, gets used in a list, and then is edited to have a malformed URL in a later version of the file.
NONTERMINAL_UNDERSCORE Black no no no
OUT_OF_BOUNDS Red no yes yes
SHARP_ANGLE Red no yes no This is the one currently black error type that would become red.
US_BANNER Black no no no
VISIBLE_DISTANCE Black no no no
VISIBLE_HIDDEN_COLOC Black no no sort of "One-way navigation” is possible, though arguably the desired effect when FP.

How many years are the OR errors there?

What do you mean by "OR errors"?

michihdeu commented 4 years ago

To break it all down by error type

Your proposal should be fine. What will change (I think I've missed some very specific error types currently not on datacheck.php):

We could apply the very same rules to WPT editor. We could stick indicating the red errors in red and indicate the black errors in a different color (e.g. orange). If so, I agree with SA being red. If not, I'm not sure........

How many years are the OR errors there?

What do you mean by "OR errors"?

http://travelmapping.net/devel/datacheck.php?rg=OR

yakra commented 4 years ago

Oh, duh. :) I won't log back in to noreaster and run that shell script again just now, but the OR errors here date to between 2017-02-04 & 2019-03-11.

2019-03-11 or.or018;+x1(OR233);;;HIDDEN_JUNCTION;3
a1799ebaa91 (Jim Teresco 2019-03-11 13:39:27 -0400 48) +x1(OR233) http://www.openstreetmap.org/?lat=45.233891&lon=-123.064964

2017-02-04 or.or019;+x8(OR208);;;HIDDEN_JUNCTION;3
13073174191 (Jim Teresco 2017-02-04 16:07:47 -0500 38) +x8(OR208) http://www.openstreetmap.org/?lat=44.809944&lon=-119.907531

2017-02-04 or.or022;+x1(OR99EBus);;;HIDDEN_JUNCTION;3
13073174191 (Jim Teresco 2017-02-04 16:07:47 -0500  47) +x1(OR99EBus) http://www.openstreetmap.org/?lat=44.940182&lon=-123.042412

2018-11-26 or.or039kla;OR39;;;LABEL_SELFREF;
e5407b1af73 hwy_data/OR/usaor/or.or039kla.wpt    (Jim Teresco 2018-11-26 21:38:55 -0500 8) OR39 http://www.openstreetmap.org/?lat=42.206508&lon=-121.736744

2019-03-11 or.or099;+x39;I-5(188A);+x33(I-5);SHARP_ANGLE;148.81
a1799ebaa91 (Jim Teresco 2019-03-11 13:39:27 -0400 218) +x39 http://www.openstreetmap.org/?lat=43.997607&lon=-123.009818

2019-03-11 or.or207;+X751840;+X592203;+X989432;SHARP_ANGLE;174.50
a1799ebaa91 (Jim Teresco 2019-03-11 13:39:27 -0400 30) +X751840 http://www.openstreetmap.org/?lat=44.948763&lon=-119.702268
yakra commented 4 years ago
michihdeu commented 4 years ago

Hmmm... SHARP_ANGLE and BAD_ANGLE... yes, should be treated the same way (red). Do we still need to distinguish them when they are both of the same category? I don't get what it means.

But the categories (or the changes) are all fine to me 👍

yakra commented 4 years ago

BAD_ANGLE is when two successive points have duplicate coords, and thus the angle can't be calculated, because division by zero.

michihdeu commented 4 years ago

ok, got it. Thx!

michihdeu commented 4 years ago

I think it's just this IF instruction:

https://github.com/TravelMapping/Web/blob/master/devel_new/datacheck.php#L60

I'll have a try....

michihdeu commented 4 years ago

Partial / Conditional: DUPLICATE_LABEL red if visible, black if hidden

I don't know how to implement this since the + is already removed beforehand and X is a valid character. Is there any flag (in DB?) that a wp is hidden?

In addition, I'm not familar with the programming language. How the table is filled / how to deal with variables.

michihdeu commented 4 years ago

The new DISCONNECTED_ROUTE data check is missing on the list.

Since the 6-field multi-region user list file entries are effected, I think it is a critical error and should be output in red. No additional change to datacheck.php required.

yakra commented 4 years ago

Partial / Conditional: DUPLICATE_LABEL red if visible, black if hidden

I don't know how to implement this since the + is already removed beforehand and X is a valid character. Is there any flag (in DB?) that a wp is hidden?

Yikes! How did I not think of this earlier?...

There is no flag in the DB to indicate a hidden point. We have to ignore leading +s when checking for duplicates of course, because .list processing ignores them. Not a big deal in & of itself; I could easily have siteupdate retain the + after making the comparison.

...but AltLabels get a bit tricky...

Since the goal here is to determine whether a point is visible/hidden overall...

The part of me that likes precision bristles at this; it's a bit klugey. I can see someone seeing +NoPlusAlt listed on datacheck.php, searching for that string in the .wpt file & not finding it, and getting confused or shrugging and moving on.


A workaround?

Re LABEL_INVALID_CHAR cases, I wrote:

What would be the more useful format for flagging these?...

Listing the primary label under Waypoints, with the relevant label under Info? me.us001;NH/ME;;;LABEL_INVALID_CHAR;+Foo#Bar

Or, listing the relevant label under Waypoints, with nothing under Info? me.us001;+Foo#Bar;;;LABEL_INVALID_CHAR;

@jteresco responded:

I prefer the offending alt label in the second field. Keep it simple. ... Option 2, the one that doesn't have the primary label included at all.

...so I went with that option.

Using an option like the 1st one for DUPLICATE_LABEL cases can distinguish visible from hidden points, while avoiding the pitfall of listing a string on the datacheck page that can't be found in the .wpt file.

A side benefit is being able to lookup point names (AltLabels are not in the DB) for HB links on datacheck.php


I think this all might be too much detail though, too much of a chase for perfection. Users still can, and do, use hidden points in their travels, so hidden points should still be fixed. IMO it's wrongheaded to deprioritize them.

@michihdeu wrote:

I wanna say that +X wps are only "known" by hwy data managers (or "wpt editor users"). They are aware of +X labels and they could use them in their list files. Normal users not.

A few counterexamples to this argument: bogdymol.list#L1023 johninkingwood.list#L2545 rlee.list#L1682