TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
4 stars 6 forks source link

concurrency detection bug #645

Open yakra opened 3 days ago

yakra commented 3 days ago

Uncovered some concurrency problem while trying to expand the HIDDEN_JUNCTION datacheck to catch cases that'd produce a segment name mismatch error (#91, #178, #603) in datacheck mode before making it to the graph generation process.

The fact that the DIV points are hidden does not affect this.

The first thing that comes to mind is that ME DE is not concurrent with any other route, whereas PA US19TrkPit has 4 others in the mix. IIRC this fact caused wacky hijinks in my first epic concurrency debugathon back in 2018. OK yeah, that's it. Thank God, I didn't wanna do another one. https://github.com/TravelMapping/DataProcessing/blob/e280bd58d47f89cb4b64d7450dc666bfe1ad038e/siteupdate/cplusplus/tasks/concurrency_detection.cpp#L11

In this case p[1] is not colocated. It's a single point on a single route that's doing a 180° turn. Something like this would be overkill & less efficient; I should be able to move an if and bung in an else block.

yakra commented 3 days ago

Fix is implemented. 3 new lines of code, not counting a comment and one that's just a }. Looks like these cases are quite common in RailwayData; 47 of them. Still checking my work.

yakra commented 3 days ago

checklist

yakra commented 3 days ago

stats CSVs

#diff size for each file not including final TOTAL row
files=$( \
for f in $(tail -n +3 d4a3_conc.l0g | head -n 34 \
           | sed -r -e "s~$e\[[0-9]+m~~" -e 's~.*_d4a3/stats/(.+) and .*~\1~'); do
  printf "%02i $f\n" $(diff <(head -n -1 _d4a3/stats/$f) <(head -n -1 _conc/stats/$f) | wc -l);
done)

#get deltas from the last lines of those without diffs, formatted to paste into the smbr spreadsheet tab
for f in $(echo "$files" | grep ^00 | cut -f2 -d' '); do
  # get column numbers
  cols=$(diff <(tail -n 1 _d4a3/stats/$f | sed -e 's~,~\n\n~g' -e 's~TOTAL~\n0~') \
              <(tail -n 1 _conc/stats/$f | sed -e 's~,~\n\n~g' -e 's~TOTAL~\n0~') \
         | egrep '([0-9]+)c\1' | sed -r 's~([0-9]+)c\1~\1/2~' | bc | tail -n +2) # skip trav total column
  for col in $cols; do
    rg=`head -n 1 _d4a3/stats/$f | cut -f$col -d,`
    d=`paste -d- <(tail -n 1 _d4a3/stats/$f | cut -f$col -d,) \
                 <(tail -n 1 _conc/stats/$f | cut -f$col -d,) | bc`
    echo $f | sed "s~-all.csv~\t$rg\t$d~"
  done
done

#do those with traveler diffs
for f in $(echo "$files" | grep -v ^00 | cut -f2 -d' '); do
  #get row numbers
  rows=$(diff _d4a3/stats/$f _conc/stats/$f \
         | egrep '([0-9]+,[0-9]+)c\1|([0-9]+)c\2' \
         | sed -e 's~c.*~~' -e 's~,~ ~')
  for row in $rows; do
    l1=`tail -n +$row _d4a3/stats/$f | head -n 1`
    l2=`tail -n +$row _conc/stats/$f | head -n 1`
    t=`echo $l1 | cut -f1 -d,`
    # get column numbers
    cols=$(diff <(echo; echo $l1 | sed -e 's~,~\n\n~g') \
                <(echo; echo $l2 | sed -e 's~,~\n\n~g') \
           | egrep '([0-9]+)c\1' | sed -r 's~([0-9]+)c\1~\1/2~' | bc)
    for col in $cols; do
      rg=`head -n 1 _d4a3/stats/$f | cut -f$col -d,`
      d=`paste -d- <(echo $l1 | cut -f$col -d,) <(echo $l2 | cut -f$col -d,) | bc`
      echo $f | sed "s~[-.].*~\t$rg\t$t\t$d~"
    done
  done
done

grep & compare...

yakra commented 3 days ago

routedatastats.log Breakdown by region

yakra commented 3 days ago

routedatastats.log Breakdown by system

yakra commented 2 days ago

LOL clinched table

#find new concurrencies in concurrencies.log & & search for augments based on them
cl=$( \
fgrep -f \
  <(diff <(grep -v augment _d4a3/logs/concurrencies.log) \
         <(grep -v augment _conc/logs/concurrencies.log) \
    | grep '^>' | sed -r 's~^> New concurrency (\[.+\])(\[.+\]) \(2\)$~\1 based on \2\n\2 based on \1~') \
  _conc/logs/concurrencies.log \
| sed -r 's~Concurrency augment for traveler (.+) based on .+~\1~')

#convert DB lines to the same style
db=$( \
for e in $(diff <(tail -n +354277 _d4a3/TravelMapping.sql | head -n 212997 | sed s/^,// | sort) \
                <(tail -n +354277 _conc/TravelMapping.sql | head -n 213005 | sed s/^,// | sort) \
           | grep '^>' | cut -f2 -d" "); do
  t=`echo "$e" | cut -f4 -d"'"`
  id=`echo "$e" | cut -f2 -d"'"`
  s_line=`tail -n +186429 _d4a3/TravelMapping.sql | head -n 167848 | egrep "^,?\('$id'"`
  r=`echo "$s_line" | cut -f8 -d"'"`
  r=`tail -n +1494 _d4a3/TravelMapping.sql | head -n 5695 | grep "$r'"`
  r=`(echo "$r" | cut -f4 -d"'"; echo "$r" | cut -f6 -d"'") | tr '\n' ' '`
  i=`echo "$s_line" | cut -f4 -d"'"`
  w_pair=`tail -n +12887 _d4a3/TravelMapping.sql | head -n 173542 | grep -A 1 "^,\?('$i'" | tr -d '\n'`
  w=`echo "$w_pair" | cut -f4 -d"'"`
  p=`echo "$w_pair" | cut -f14 -d"'"`
  echo "$t: [$r$w $p]"
done)

#now sort & diff them
diff <(echo "$cl" | sort) <(echo "$db" | sort)
yakra commented 1 day ago
graphs=$(diff -qr _d4a3/graphs/ _conc/graphs/ | cut -f3 -d/ | sed -e 's~\.tmg.*~~' -e 's~-simple$\|-traveled$~~' | uniq)
allgraphs=$(diff -qr _d4a3/graphs/ _conc/graphs/ | cut -f3 -d/ | sed -e 's~\.tmg.*~~')

Simple graphs

yakra commented 1 day ago

Collapsed graph vertices

Number of vertices differ in some collapsed graphs. Makes sense though. The difference is in whether or not a vertex's incident edges are collapsed around it, causing it to go from a bona fide visible vertex to one of the intermediate_points along a collapsed edge.

e=`echo -en '\e'`
for g in $graphs; do
  oldv=`head -n 2 _d4a3/graphs/$g.tmg | tail -n 1 | cut -f1 -d' '`
  newv=`head -n 2 _conc/graphs/$g.tmg | tail -n 1 | cut -f1 -d' '`
  d=`diff <(tail -n +3 _d4a3/graphs/$g.tmg | head -n $oldv) \
          <(tail -n +3 _conc/graphs/$g.tmg | head -n $newv) | grep '^[<>]'`
  if [ `echo "$d" | wc -w` -gt 0 ]; then echo -e "$g\n$d"; fi
done \
| sed -e "s~^<.*~$e[31m&$e[0m~" \
      -e "s~^>.*~$e[32m&$e[0m~"

Each line item diff falls into 1 of 2 categories:

  1. One less vertex. Adjacent to a U-turn. 8 of these worldwide. Before: 2 segments not flagged as concurrent cause 2 edges & a hidden junction. Vertex unhidden. After: Segments flagged as concurrent get 1 edge. Vertex has 2 incident edges, stays hidden, gets collapsed.
  2. One more vertex. Is a hidden U-turn. 2 of these, both in CZE. Before: 2 incident edges for an undetected concurrency collapse around the U-turn to 1 edge starting/ending @ same point. After: 1 incident edge. Vertex unhidden. No collapse occurs.
yakra commented 19 hours ago

Each edge line in a TMG file starts with 2 indices into the vertex array. Meaning whenever a vertex is added or removed, all the others after it have different indices, and all references to them by edge lines change. Thus TMGs aren't diffable out of the box. The solution to this is to replace the numbers on the edge lines with a unique stable identifier. What better then the vertex labels? I won't be able to load these into the HDX, but so what; I only care about diffing them.

for g in `ls | grep -v 'simple\|traveled'`; do
  echo $g
  v_count=`tail -n +2 $g | head -n 1 | cut -f1 -d' '`
  e_count=`tail -n +2 $g | head -n 1 | cut -f2 -d' '`
  head -n `expr $v_count + 2` $g > ../schmaphs/$g
  beg=`expr $v_count + 3`
  end=`expr $e_count - 1 + $beg`
  e=1
  for lnum in $(seq $beg $end); do
    line=`tail -n +$lnum $g | head -n 1`
    i1=`echo "$line" | cut -f1 -d' '`
    i2=`echo "$line" | cut -f2 -d' '`
    v1=`expr $i1 + 3`
    v2=`expr $i2 + 3`
    v1=`tail -n +$v1 $g | head -n 1 | cut -f1 -d' '`
    v2=`tail -n +$v2 $g | head -n 1 | cut -f1 -d' '`
    echo $line | sed "s~^$i1 $i2 ~$v1 $v2 ~" >> ../schmaphs/$g
    echo -en "$e/$e_count\r"
    e=`expr $e + 1`
  done
done

LOL, this took 3 hours and 8 seconds to run, and in the end didn't even work right. While waiting for it to finish, I rewrote it in C++ so the 2nd batch would go faster.

#include <cstdlib>
#include <fstream>
#include <iostream>
#include <vector>
using namespace std;

int main(int argc, char *argv[])
{   cout << argv[1] << endl;
    ifstream ifile(argv[1]);
    ofstream ofile(string("../schmaphs/")+argv[1]);
    vector<string> L;
    string line;
    while (getline(ifile, line)) L.push_back(move(line));
    ifile.close();
    char* endptr;
    size_t v_count = strtoul(&L[1][0], &endptr, 10);
    size_t e_count = strtoul(endptr+1, &endptr, 10);
    ofile << L[0] << '\n' << L[1] << '\n';
    size_t ebeg = v_count + 2;
    size_t eend = e_count + ebeg;
    size_t lnum = 2;
    for (; lnum < ebeg; lnum++)
        ofile << L[lnum] << '\n';
    for (; lnum < L.size(); lnum++)
    {   string& l = L[lnum];
        cout << &l - &L[0] << '/' << e_count << '\r';
        char *p1, *p2;
        size_t i1 = strtoul(&l[0], &p1, 10) + 2;
        size_t i2 = strtoul(p1+1 , &p2, 10) + 2;
        ofile << L[i1].substr(0, L[i1].find(' ')) << ' ';
        ofile << L[i2].substr(0, L[i2].find(' ')) << p2 << '\n';
    }
}

This took 4.6 seconds. LOL.

yakra commented 13 hours ago

Collapsed graph edges

Looking at the diffs between graph edges shows that diffs don't just occur one-by-one. They come in clusters, in one of three patterns. These patterns can be mixed & matched in a graph, and occur however many times.

  1. Scenario 1 above, when the graph loses a vertex.
    < Vertex adjacent to the U-turn point goes from being incorrectly being considered a hidden junction to getting collapsed
    < Simple edge from the U-turn point to the deleted adjacent point
    < Simple edge connecting the same points in the opposite direction
    < Edge connecting the deleted point to some other point farther down the line
    > Collapsed edge connecting the U-turn to the faraway point

    Happens when the U-turn vertex is visible & the adjacent vertex is collapsible.

  2. Scenario 2 above, when the graph gains a vertex.
    > U-turn vertex goes from being incorrectly collapsed to visible per having 1 incident edge
    > Simple edge from the U-turn to the adjacent point
    < Collapsed edge connecting the adjacent point to itself via the U-turn as an intermediate

    Happens when the U-turn vertex is hidden & the adjacent vertex is visible or a junction.

  3. Most commonly, a newly detected concurrency won't add or remove any vertices.
    < Simple edge from the U-turn point to the adjacent point
    < Simple edge connecting the same points in the opposite direction
    > A new simple edge from the U-turn to the adjacent point again. Only this time, label has a doubled route name, e.g. "DE,DE"

    Happens when the U-turn & adjacent vertices are both visible.

  4. A 4th possibility, I didn't observe in the wild and had to create "in the lab". 1 vertex added & 1 removed, for the same total.
    < Shaping point adjacent to the U-turn point goes from being incorrectly being considered a hidden junction to getting collapsed
    > U-turn vertex goes from being incorrectly collapsed to visible per having 1 incident edge
    < Edge connecting the deleted point to some other point farther down the line
    < Collapsed edge connecting the adjacent point to itself via the U-turn as an intermediate
    > Collapsed edge connecting the U-turn to the faraway point

    Happens when the U-turn & adjacent vertices are both hidden.

yakra commented 9 hours ago

Traveled graphs

If collapsed graphs make sense, traveled graphs will be fine too. The decision whether or not to collapse has nothing to do with the new concurrencies at this point; no new more info to be gained. Nonetheless, this...

trav=$( \
for g in $graphs; do \
  d=$(diff <(tail -n +2 _conc/graphs/$g.tmg | head -n 1) \
           <(tail -n +2 _conc/graphs/$g-traveled.tmg | head -n 1 | sed -r 's~(.+ .+) .*~\1~') \
      | grep '^[<>]')
  if [ `echo "$d" | wc -w` -gt 0 ]; then echo "$g"; fi; done)

e=`echo -en '\e'`
for g in $trav; do
  cv=$(tail -n +2 _conc/graphs/$g.tmg | head -n 1 | cut -f1 -d' ')
  tv=$(tail -n +2 _conc/graphs/$g-traveled.tmg | head -n 1 | cut -f1 -d' ')
  echo "$e[1m$g$e[0m"
  d=$(diff <(tail -n +3 _conc/graphs/$g.tmg | head -n $cv) \
           <(tail -n +3 _conc/graphs/$g-traveled.tmg | head -n $tv) | grep '^[<>]')
  for l in $(echo "$d" | tr ' ' '%'); do
    echo "$l" | tr '%' ' '
    fgrep -ir -f <(echo "$l" | cut -f2 -d@ | cut -f1 -d% | sed 's~^+~~') ~/tm/UserData/rlist_files
  done
done

...gives strong (though not foolproof) evidence that all uncollapsed vertices are used in .lists. Only 2 points gave no results, which were found manually