concurrency detection bug

TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project

4 stars 6 forks source link

concurrency detection bug #645

Closed yakra closed 1 month ago

yakra commented 2 months ago

Uncovered some concurrency problem while trying to expand the HIDDEN_JUNCTION datacheck to catch cases that'd produce a segment name mismatch error (#91, #178, #603) in datacheck mode before making it to the graph generation process.

RailwayData (eeea42f): ME DE +DIV_Por1 Por is not flagged as concurrent with ME DE Por +DIV_Por2.
Meanwhile, PA US19TrkPit I-376(69B)_S I-376(69A) is concurrent with PA US19TrkPit I-376(69A) I-376(69B)_N as expected (7a955f2).

The fact that the DIV points are hidden does not affect this.

The first thing that comes to mind is that ME DE is not concurrent with any other route, whereas PA US19TrkPit has 4 others in the mix. IIRC this fact caused wacky hijinks in my first epic concurrency debugathon back in 2018. OK yeah, that's it. Thank God, I didn't wanna do another one. https://github.com/TravelMapping/DataProcessing/blob/e280bd58d47f89cb4b64d7450dc666bfe1ad038e/siteupdate/cplusplus/tasks/concurrency_detection.cpp#L11

In this case p[1] is not colocated. It's a single point on a single route that's doing a 180° turn. Something like this would be overkill & less efficient; I should be able to move an if and bung in an else block.

yakra commented 2 months ago

Fix is implemented. 3 new lines of code, not counting a comment and one that's just a }. Looks like these cases are quite common in RailwayData; 47 of them. Still checking my work.

yakra commented 2 months ago

checklist

[x] stats CSVs
[x] userlogs Overall mileages in systems & regions have gone down due to more multiplexes detected. Makes sense. Do spot checks of some changed stats. Do regional & overall a/p miles check out too?
- [x] bartpetat on belic in NLD: IC35
- [x] bejacob on GlaDis (Glacier Discovery) (up 0.66 via augment)
- [x] communityrailpartnerships on gbngw in ENG: LooeVlyLn
- [x] communityrailpartnerships on gbnle (ENG only): BitLn
- [x] communityrailpartnerships on gbnnr (ENG only): multiple routes
- [x] hotdogPi on Downeaster, ME & ConnectedRoute (augmented 0.68 connected, 0.69 chopped)
- [x] M3200 on FarNorLn (Far North Line) (up 1.6 via augment)
- [x] M3200 on Downeaster, ME & ConnectedRoute (augmented 0.68 connected, 0.69 chopped)
- [x] M3200 on Silver Star, FL & ConnectedRoute (up 14.15 via augment)
- [x] M3200 on GlaDis (Glacier Discovery) (up 0.66 via augment)
- [x] michih on deuhhs in DEU-HH: S1
- [x] michih on deuulrs in DEU-BW: RS21
- [x] michih on deuhalt (DEU-ST only): 5Hal
- [x] mojavenc has unchanged travels on belic in NLD. Yep, that's fine. :)
- [x] neptun on czedukt in CZE: multiple routes
- [x] neptun on czeep (CZE only): multiple routes
- [x] oscar on WinChu (Winnipeg-Churchill): MB & ConnectedRoute (up 0.74 via augment)
- [x] scenicrailbritain.log on gbngw in ENG: LooeVlyLn
- [x] scenicrailbritain.log on gbnle (ENG only): BitLn
- [x] scenicrailbritain.log on gbnnr (ENG only): GloLn
- [x] scenicrailbritain.log on gbnsr in SCT: FarNorLn
- [x] selectric on usaamtk in ME: Downeaster
[x] concurrencies.log
[ ] DB
- [x] clinched Line counts mismatch 212997 <-> 213005
- [x] overallMileageByRegion
- [x] systemMileageByRegion
- [x] clinchedOverallMileageByRegion
- [x] clinchedSystemMileageByRegion
- [x] clinchedConnectedRoutes
- [x] clinchedRoutes
- [ ] graphs
[ ] graphs should have fewer segments. Same # vertices & travelers, right?

yakra commented 2 months ago

stats CSVs

#diff size for each file not including final TOTAL row
files=$( \
for f in $(tail -n +3 d4a3_conc.l0g | head -n 34 \
           | sed -r -e "s~$e\[[0-9]+m~~" -e 's~.*_d4a3/stats/(.+) and .*~\1~'); do
  printf "%02i $f\n" $(diff <(head -n -1 _d4a3/stats/$f) <(head -n -1 _conc/stats/$f) | wc -l);
done)

#get deltas from the last lines of those without diffs, formatted to paste into the smbr spreadsheet tab
for f in $(echo "$files" | grep ^00 | cut -f2 -d' '); do
  # get column numbers
  cols=$(diff <(tail -n 1 _d4a3/stats/$f | sed -e 's~,~\n\n~g' -e 's~TOTAL~\n0~') \
              <(tail -n 1 _conc/stats/$f | sed -e 's~,~\n\n~g' -e 's~TOTAL~\n0~') \
         | egrep '([0-9]+)c\1' | sed -r 's~([0-9]+)c\1~\1/2~' | bc | tail -n +2) # skip trav total column
  for col in $cols; do
    rg=`head -n 1 _d4a3/stats/$f | cut -f$col -d,`
    d=`paste -d- <(tail -n 1 _d4a3/stats/$f | cut -f$col -d,) \
                 <(tail -n 1 _conc/stats/$f | cut -f$col -d,) | bc`
    echo $f | sed "s~-all.csv~\t$rg\t$d~"
  done
done

#do those with traveler diffs
for f in $(echo "$files" | grep -v ^00 | cut -f2 -d' '); do
  #get row numbers
  rows=$(diff _d4a3/stats/$f _conc/stats/$f \
         | egrep '([0-9]+,[0-9]+)c\1|([0-9]+)c\2' \
         | sed -e 's~c.*~~' -e 's~,~ ~')
  for row in $rows; do
    l1=`tail -n +$row _d4a3/stats/$f | head -n 1`
    l2=`tail -n +$row _conc/stats/$f | head -n 1`
    t=`echo $l1 | cut -f1 -d,`
    # get column numbers
    cols=$(diff <(echo; echo $l1 | sed -e 's~,~\n\n~g') \
                <(echo; echo $l2 | sed -e 's~,~\n\n~g') \
           | egrep '([0-9]+)c\1' | sed -r 's~([0-9]+)c\1~\1/2~' | bc)
    for col in $cols; do
      rg=`head -n 1 _d4a3/stats/$f | cut -f$col -d,`
      d=`paste -d- <(echo $l1 | cut -f$col -d,) <(echo $l2 | cut -f$col -d,) | bc`
      echo $f | sed "s~[-.].*~\t$rg\t$t\t$d~"
    done
  done
done

grep & compare...

| grep Total.*TOTAL per-system & site-wide totals all match routedatastats.log

#set $table to the result of that last big shell script block, and then...
users=`echo "$table" | cut -f3 | sort | uniq`
for u in $users; do clear; echo "$table" | grep $u; pluma _{d4a3,conc}/logs/users/$u.log; read OK; done
# ...until getting to "TOTAL". This all matches userlogs. Then...
echo "$table" | grep -v Total | grep $TOTAL

[x] allbyregionactivepreview: per-region totals all match routedatastats.log
[x] individual systems: system-region totals all match routedatastats.log

yakra commented 2 months ago

routedatastats.log Breakdown by region

[x] AK down 0.66
[x] AUS-QLD down 0.24 (rounding vs 0.25/0.25)
[x] AUT down 0.73
[x] BEL down 0.09
[x] BRA-RJ down 0.22
[x] CHN-TJ down 1.37 (rounding vs 1.36)
[x] CZE down 0.91 (rounding vs 0.90)
[x] DEU-BW down 1.45
[x] DEU-HB down 0.26
[x] DEU-HH down 0.16
[x] DEU-NW down 0.75 (rounding vs 0.76/0.75)
[x] DEU-ST down 0.38 (rounding vs 0.39)
[x] ENG down 3.26
[x] ESP-VC down 0.29
[x] FL down 14.14 (rounding vs 14.15)
[x] FRA-GES down 0.14
[x] FRA-IDF down 0.21
[x] GRC down 0.11
[x] IRL down 2.92 (rounding vs 2.91/2.93)
[x] ITA-PIE down 0.18 (rounding vs 0.19)
[x] MB down 0.74
[x] ME down 0.69
[x] NIR down 0.16
[x] NLD down 1.25 (rounding vs 1.26/1.25)
[x] NOR down 1.39
[x] POL down 0.61 (rounding vs 0.61/0.62)
[x] SCT down 1.6 (rounding vs 1.59)
[x] VT down 1.37 (rounding vs 1.37/1.36)
[x] WLS down 0.14

yakra commented 2 months ago

routedatastats.log Breakdown by system

[x] ausqrt down .25
[x] autsst down .40
[x] autsw down .33
[x] belic down 1.26 (rounding vs 1.26/1.25)
- [x] NLD down 1.26 (rounding vs 1.26/1.25)
[x] belcm down .09
[x] bravltrio down .23 (rounding vs 0.22)
[x] canvia down .74
- [x] MB down .73 (rounding vs 0.74)
[x] chncrh down 1.36
- [x] CHN-TJ down 1.37 (rounding vs 1.36)
[x] czedukt down .65 (rounding vs 0.65/0.66)
- [x] CZE down .65 (rounding vs 0.65/0.66)
[x] czeep down .25
[x] deuhbs down .26
- [x] DEU-HB down .26
[x] deuhhs down .16
- [x] DEU-HH down .16
[x] deurrs down .58 (rounding vs 0.59/0.58)
[x] deuulrs down 1.45
- [x] DEU-BW down 1.45
[x] deurru down .16 (rounding vs 0.17/0.17)
[x] deuhalt down .38 (rounding vs 0.39)
[x] espave down .19 (rounding vs 0.18)
- [x] ESP-VC down .19 (rounding vs 0.18)
[x] espcmu down .11
- [x] ESP-VC down .11
[x] eurapm down .08
- [x] FRA-IDF down .09 (rounding vs 0.08)
[x] eurhr down 1.23
- [x] ENG down 0.92 (rounding vs 0.93)
- [x] NIR down 0.16
- [x] WLS down 0.14
[x] fratgv down .14
- [x] FRA-GES down .14
[x] fratra down .13
- [x] FRA-IDF down .13
[x] gbngw down .15
- [x] ENG down .15
[x] gbnle down .70
[x] gbnnr down 1.49 (rounding vs 1.48/1.49)
[x] gbnsr down 1.59
- [x] SCT down 1.59
[x] grcta down .10 (rounding vs 0.10/0.11)
[x] irlie down 2.91 (rounding vs 2.91/2.93)
[x] itatfm down .18 (rounding vs 0.18/0.19)
- [x] ITA-PIE .18 (rounding vs 0.18/0.19)
[x] norf down 1.39
[x] polksl down .61 (rounding vs 0.62/0.61)
- [x] POL down .62 (rounding vs 0.62/0.61)
[x] usaamtk down 16.20 (rounding vs 16.21)
- [x] FL down 14.15
- [x] ME down 0.69
- [x] VT down 1.37 (rounding vs 1.37/1.36)
[x] usaarr down .66

yakra commented 2 months ago

LOL clinched table

#find new concurrencies in concurrencies.log & & search for augments based on them
cl=$( \
fgrep -f \
  <(diff <(grep -v augment _d4a3/logs/concurrencies.log) \
         <(grep -v augment _conc/logs/concurrencies.log) \
    | grep '^>' | sed -r 's~^> New concurrency (\[.+\])(\[.+\]) \(2\)$~\1 based on \2\n\2 based on \1~') \
  _conc/logs/concurrencies.log \
| sed -r 's~Concurrency augment for traveler (.+) based on .+~\1~')

#convert DB lines to the same style
db=$( \
for e in $(diff <(tail -n +354277 _d4a3/TravelMapping.sql | head -n 212997 | sed s/^,// | sort) \
                <(tail -n +354277 _conc/TravelMapping.sql | head -n 213005 | sed s/^,// | sort) \
           | grep '^>' | cut -f2 -d" "); do
  t=`echo "$e" | cut -f4 -d"'"`
  id=`echo "$e" | cut -f2 -d"'"`
  s_line=`tail -n +186429 _d4a3/TravelMapping.sql | head -n 167848 | egrep "^,?\('$id'"`
  r=`echo "$s_line" | cut -f8 -d"'"`
  r=`tail -n +1494 _d4a3/TravelMapping.sql | head -n 5695 | grep "$r'"`
  r=`(echo "$r" | cut -f4 -d"'"; echo "$r" | cut -f6 -d"'") | tr '\n' ' '`
  i=`echo "$s_line" | cut -f4 -d"'"`
  w_pair=`tail -n +12887 _d4a3/TravelMapping.sql | head -n 173542 | grep -A 1 "^,\?('$i'" | tr -d '\n'`
  w=`echo "$w_pair" | cut -f4 -d"'"`
  p=`echo "$w_pair" | cut -f14 -d"'"`
  echo "$t: [$r$w $p]"
done)

#now sort & diff them
diff <(echo "$cl" | sort) <(echo "$db" | sort)

yakra commented 2 months ago

graphs=$(diff -qr _d4a3/graphs/ _conc/graphs/ | cut -f3 -d/ | sed -e 's~\.tmg.*~~' -e 's~-simple$\|-traveled$~~' | uniq)
allgraphs=$(diff -qr _d4a3/graphs/ _conc/graphs/ | cut -f3 -d/ | sed -e 's~\.tmg.*~~')

Simple graphs

vertices: all same number confirmed

for g in $graphs; do
diff <(head -n 2 _d4a3/graphs/$g-simple.tmg | tail -n 1 | cut -f1 -d' ') \
     <(head -n 2 _conc/graphs/$g-simple.tmg | tail -n 1 | cut -f1 -d' ')
done

edges (67)
- region (29): all down by number of new concurrencies in that region per concurrencies.log
```
for g in $graphs; do
printf "%4i $g\n" \
     $(paste -d- <(head -n 2 _d4a3/graphs/$g-simple.tmg | tail -n 1 | cut -f2 -d' ') \
                 <(head -n 2 _conc/graphs/$g-simple.tmg | tail -n 1 | cut -f2 -d' ') \
       | bc)
done | grep -e -region
```
- country (10): all match totals of constituent regions same script but | grep -e -country instead
- continent (5): all match totals of constituent regions same script but | grep -e -continent instead
- tm-master 47, matches total of all continents and total new concurrencies in concurrencies.log
- siena100-area down 1 for VT EA Rut
- multiregion (6): all match totals of constituent regions
- multisystem (15): all match totals of constituent systems

yakra commented 2 months ago

Collapsed graph vertices

Number of vertices differ in some collapsed graphs. Makes sense though. The difference is in whether or not a vertex's incident edges are collapsed around it, causing it to go from a bona fide visible vertex to one of the intermediate_points along a collapsed edge.

e=`echo -en '\e'`
for g in $graphs; do
  oldv=`head -n 2 _d4a3/graphs/$g.tmg | tail -n 1 | cut -f1 -d' '`
  newv=`head -n 2 _conc/graphs/$g.tmg | tail -n 1 | cut -f1 -d' '`
  d=`diff <(tail -n +3 _d4a3/graphs/$g.tmg | head -n $oldv) \
          <(tail -n +3 _conc/graphs/$g.tmg | head -n $newv) | grep '^[<>]'`
  if [ `echo "$d" | wc -w` -gt 0 ]; then echo -e "$g\n$d"; fi
done \
| sed -e "s~^<.*~$e[31m&$e[0m~" \
      -e "s~^>.*~$e[32m&$e[0m~"

Each line item diff falls into 1 of 2 categories:

One less vertex. Adjacent to a U-turn. 8 of these worldwide. Before: 2 segments not flagged as concurrent cause 2 edges & a hidden junction. Vertex unhidden. After: Segments flagged as concurrent get 1 edge. Vertex has 2 incident edges, stays hidden, gets collapsed.
One more vertex. Is a hidden U-turn. 2 of these, both in CZE. Before: 2 incident edges for an undetected concurrency collapse around the U-turn to 1 edge starting/ending @ same point. After: 1 incident edge. Vertex unhidden. No collapse occurs.

yakra commented 1 month ago

Each edge line in a TMG file starts with 2 indices into the vertex array. Meaning whenever a vertex is added or removed, all the others after it have different indices, and all references to them by edge lines change. Thus TMGs aren't diffable out of the box. The solution to this is to replace the numbers on the edge lines with a unique stable identifier. What better then the vertex labels? I won't be able to load these into the HDX, but so what; I only care about diffing them.

for g in `ls | grep -v 'simple\|traveled'`; do
  echo $g
  v_count=`tail -n +2 $g | head -n 1 | cut -f1 -d' '`
  e_count=`tail -n +2 $g | head -n 1 | cut -f2 -d' '`
  head -n `expr $v_count + 2` $g > ../schmaphs/$g
  beg=`expr $v_count + 3`
  end=`expr $e_count - 1 + $beg`
  e=1
  for lnum in $(seq $beg $end); do
    line=`tail -n +$lnum $g | head -n 1`
    i1=`echo "$line" | cut -f1 -d' '`
    i2=`echo "$line" | cut -f2 -d' '`
    v1=`expr $i1 + 3`
    v2=`expr $i2 + 3`
    v1=`tail -n +$v1 $g | head -n 1 | cut -f1 -d' '`
    v2=`tail -n +$v2 $g | head -n 1 | cut -f1 -d' '`
    echo $line | sed "s~^$i1 $i2 ~$v1 $v2 ~" >> ../schmaphs/$g
    echo -en "$e/$e_count\r"
    e=`expr $e + 1`
  done
done

LOL, this took 3 hours and 8 seconds to run, and in the end didn't even work right. While waiting for it to finish, I rewrote it in C++ so the 2nd batch would go faster.

#include <cstdlib>
#include <fstream>
#include <iostream>
#include <vector>
using namespace std;

int main(int argc, char *argv[])
{   cout << argv[1] << endl;
    ifstream ifile(argv[1]);
    ofstream ofile(string("../schmaphs/")+argv[1]);
    vector<string> L;
    string line;
    while (getline(ifile, line)) L.push_back(move(line));
    ifile.close();
    char* endptr;
    size_t v_count = strtoul(&L[1][0], &endptr, 10);
    size_t e_count = strtoul(endptr+1, &endptr, 10);
    ofile << L[0] << '\n' << L[1] << '\n';
    size_t ebeg = v_count + 2;
    size_t eend = e_count + ebeg;
    size_t lnum = 2;
    for (; lnum < ebeg; lnum++)
        ofile << L[lnum] << '\n';
    for (; lnum < L.size(); lnum++)
    {   string& l = L[lnum];
        cout << &l - &L[0] << '/' << e_count << '\r';
        char *p1, *p2;
        size_t i1 = strtoul(&l[0], &p1, 10) + 2;
        size_t i2 = strtoul(p1+1 , &p2, 10) + 2;
        ofile << L[i1].substr(0, L[i1].find(' ')) << ' ';
        ofile << L[i2].substr(0, L[i2].find(' ')) << p2 << '\n';
    }
}

This took 4.6 seconds. LOL.

yakra commented 1 month ago

Collapsed graph edges

Looking at the diffs between graph edges shows that diffs don't just occur one-by-one. They come in clusters, in one of three patterns. These patterns can be mixed & matched in a graph, and occur however many times.

Scenario 1 above, when the graph loses a vertex.

< Vertex adjacent to the U-turn point goes from being incorrectly being considered a hidden junction to getting collapsed
< Simple edge from the U-turn point to the deleted adjacent point
< Simple edge connecting the same points in the opposite direction
< Edge connecting the deleted point to some other point farther down the line
> Collapsed edge connecting the U-turn to the faraway point

Happens when the U-turn vertex is visible & the adjacent vertex is collapsible.

Scenario 2 above, when the graph gains a vertex.

> U-turn vertex goes from being incorrectly collapsed to visible per having 1 incident edge
> Simple edge from the U-turn to the adjacent point
< Collapsed edge connecting the adjacent point to itself via the U-turn as an intermediate

Happens when the U-turn vertex is hidden & the adjacent vertex is visible or a junction.

Most commonly, a newly detected concurrency won't add or remove any vertices.

< Simple edge from the U-turn point to the adjacent point
< Simple edge connecting the same points in the opposite direction
> A new simple edge from the U-turn to the adjacent point again. Only this time, label has a doubled route name, e.g. "DE,DE"

Happens when the U-turn & adjacent vertices are both visible.

A 4th possibility, I didn't observe in the wild and had to create "in the lab". 1 vertex added & 1 removed, for the same total.

< Shaping point adjacent to the U-turn point goes from being incorrectly being considered a hidden junction to getting collapsed
> U-turn vertex goes from being incorrectly collapsed to visible per having 1 incident edge
< Edge connecting the deleted point to some other point farther down the line
< Collapsed edge connecting the adjacent point to itself via the U-turn as an intermediate
> Collapsed edge connecting the U-turn to the faraway point

Happens when the U-turn & adjacent vertices are both hidden.

yakra commented 1 month ago

Traveled graphs

If collapsed graphs make sense, traveled graphs will be fine too. The decision whether or not to collapse has nothing to do with the new concurrencies at this point; no new more info to be gained. Nonetheless, this...

trav=$( \
for g in $graphs; do \
  d=$(diff <(tail -n +2 _conc/graphs/$g.tmg | head -n 1) \
           <(tail -n +2 _conc/graphs/$g-traveled.tmg | head -n 1 | sed -r 's~(.+ .+) .*~\1~') \
      | grep '^[<>]')
  if [ `echo "$d" | wc -w` -gt 0 ]; then echo "$g"; fi; done)

e=`echo -en '\e'`
for g in $trav; do
  cv=$(tail -n +2 _conc/graphs/$g.tmg | head -n 1 | cut -f1 -d' ')
  tv=$(tail -n +2 _conc/graphs/$g-traveled.tmg | head -n 1 | cut -f1 -d' ')
  echo "$e[1m$g$e[0m"
  d=$(diff <(tail -n +3 _conc/graphs/$g.tmg | head -n $cv) \
           <(tail -n +3 _conc/graphs/$g-traveled.tmg | head -n $tv) | grep '^[<>]')
  for l in $(echo "$d" | tr ' ' '%'); do
    echo "$l" | tr '%' ' '
    fgrep -ir -f <(echo "$l" | cut -f2 -d@ | cut -f1 -d% | sed 's~^+~~') ~/tm/UserData/rlist_files
  done
done

...gives strong (though not foolproof) evidence that all uncollapsed vertices are used in .lists. Only 2 points gave no results, which were found manually

Oce/OttQue@+DIV_Cha_W in use as +DIV_Cha on the other route that didn't get the pick for the hidden vertex label
AirTrn@+SKIP_EmpOnly used via an AltRouteName, TA IIRC