TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
4 stars 6 forks source link

Graph generation: how many/which users have clinched each segment #196

Closed jteresco closed 5 years ago

jteresco commented 5 years ago

This could be a high-priority project for me in the next week or so if it's not any harder than I'm thinking. I'd like my students to be able to use it very soon.

I'd like to enable more algorithms and variations on algorithms to use the graphs. One additional bit of data that could be very interesting to work with could be the number of users or better yet a set of users who have clinched the highway segment(s) represented by each edge. This could allow all kinds of new problems to be solved interesting academically, to TM project users, or hopefully both.

From a first quick look at this, it looks like it's not hard to remember the user list for each edge, as the edges are initially created from segments, segments have a list of who clinched them, and in almost all cases, a collapsed edge should have the same list of users who have clinched each original edge. So then when the graphs are being written, the extra information could be written with each edge entry. The simplest way would be to list, maybe comma-separated, the users, but that's a lot of extra text in the files. A better way might be to assign each user a number 0 to n-1. Then the clinched info in the graphs is anonymous unless you have the mapping of users to numbers. But then each number really just represents if the user with that number has clinched the segment/edge in question. One bit of information. This means that same info could be stored in a much more compact form: a hex number with n bits, so ceiling(n/4) hex digits, where each bit corresponds to whether a specific user has clinched that segment.

jteresco commented 5 years ago

I should bring in @yakra's recent site update changes first and go from that code.

jteresco commented 5 years ago

This would be a new version number in the header of TMG files, and a new graph type.

I'd only do this for collapsed graphs. I am rarely using the simple format graphs anymore.

yakra commented 5 years ago

I'd only do this for collapsed graphs. I am rarely using the simple format graphs anymore.

https://github.com/TravelMapping/DataProcessing/issues/67#issuecomment-389742060

@yakra wrote:

Collapsed graphs could be tricky: http://tm.teresco.org/hb/index.php?units=miles&u=terescoj&r=nh.us004 Zoom in on NH120_N & NH120_S

@jteresco wrote:

Good point. Each segment would have to be all or nothing.

@yakra wrote:

Not a big deal, AIUI now. A hidden vertex can become a visible one, as already done with VISIBLE_HIDDEN_COLOC or HIDDEN_JUNCTION cases.

Thankfully, cases of this happening in the real world should be relatively rare, and we won't be saddled with bloated & ugly graphs.

How do we do this efficiently? Spitballing...

yakra commented 5 years ago

One bit of information. This means that same info could be stored in a much more compact form: a hex number with n bits, so ceiling(n/4) hex digits, where each bit corresponds to whether a specific user has clinched that segment.

On the downside:


A better way might be to assign each user a number 0 to n-1. Then the clinched info in the graphs is anonymous unless you have the mapping of users to numbers.

Should be easy to include. In a traditional TMG file, our number of lines is known, 2+v+e. Add one more line after all this simply listing the corresponding user IDs in order.


It makes sense to me to include the hex string before the shaping points' coords, as there are an indeterminate number of shaping points before end of line. For example: 383842 383839 SH43 1596dee5feeef29cb1e4220eeb77425e791cf180b2d0445faeb5972236fb -39.322691 174.412524 -39.31532 174.455826 -39.294067 174.482262

yakra commented 5 years ago

OK, I've got a draft implemented in C++.

MEX-QROO-region.tmg:

TMG 2.0 collapsed
3 2
MEX180D@MEX180 21.097814 -86.975434
MEX180D@MEX307 21.017955 -86.854477
MEX180D@YUC/QROO 20.878541 -87.608242
0 2 MEX180D 000000000000000001004000000000000000200000000000020000010000 20.97242 -87.208314 20.950859 -87.293158 20.897907 -87.441688
0 1 MEX180D 000000000000000001004000000000000000000000000000000000000000
1 1995hoo 25or6to4 7_8 Aseluit Based8 Beerman ChrisZwolle DJCane JamesMD Qiviuq ThelwallViaduct Ugnaught2 a3 aaroads afarina along angelsfreeek atsiatas barefoot_driver baugh17 bejacob benjamingrove47 bhemphill bickendan bm7 bobcobb bogdymol boris bubaclex bulldog1979 cabiness42 capn chaddean charliezeb chesspi cinx cl94 clark1mt clong codyg1985 compdude787 cougar1989 crosboro7 cyhighways dantheman dave1693 dcharlie dcm55343 dctransit deathtopumpkins dfilpus dgolub dharwood dlsterner dnthrox dognm16 dough4872 dougtone drebbin37 drfrankenstein dsaneus dscarroll dsm4cy duke87 dylucktrocket eidderf electronym epzik8 es92 eth eucitizen extremebandman finnedude foresthills93 formulanone froggie georgiaroadgeek gimmicks gman2337 golubcar gpw griffith happy5214 harvey highplainstrvlr hoopitypoop hoss6884 htm_duke hurricanehink ibexto imgoph inagaddadavida intelati49 intheam iowahighways jackgaynor jayhawkco jbum86MASTER jeffm jerseyman4 jimvette jjbers johninkingwood jonfmorse jordancarrpeterson jpinyan jstalica julmac justjake jweeks jwood keithcavey kjslaughter klundgren kramer ks0stm kurzov lakewy17 lamsalfl lkefct lowenbrau mapcat mapmikey mariethefoxy markkos1992 master_son mc49399 mefailenglish meoberto mepaulrye michih mikeandkristie mirage45331 mojavenc morriswa moshitea mr_matt_k mvak36 mwasleski myselfalso navigator neilbert neroute2 new_friends_gr ngrier niels njroadhorse noelbotevera norheim nrm1221 ntallyn ohroadscholar okroads opspe oscar osu97gp osu_lsu paddy pahighways panda80 paulthemapguy peperodriguez2710 philimon philman pianocello130 pnrrth polabare presidentman quidditch33 radison ran4sh rebelgtp regoarrarr rgentry rickmastfan67 riehlj riiga rlee roadgeek99 roadgeek_adam roadguy2 robringstad roukan route56 rschen7754 sammi sbeaver44 scheidler scott_stieglitz sercamaro sfisher si404 sipes23 skatcher skoutr1 slorydn1 sneezy snowedin snowmatt8 sounderbruce spinoza squatchis ssgtcrusty superblu2007 synkdorbit tag42481 tarheel61581 tbwillie tckma terescoj terminal_moraine the_spui_ninja thefxexpert thehighwayman3561 thing342 tikester tjj25 tomhart twinsfan87 ua747sp usaidhello valedc03ls vcap36 vdeane verruckte_dan verumsolum vespertine vinoman64 voyager105 w9tam wadsteckel whopperman wphiii xcvxcvxcv xorodejon yakra yipcoyote 

To break down the hex string:

Parsing the first one above, for example:

NH-region.tmg:

TMG 2.0 collapsed
2044 2318
...
NH120/US4@+x4plex 43.64186 -72.252298        #line 1407, vertex 1404, visible
...
1295 1404 US4,NH120 200400004200400110010000000200900004028000000000200020000006
...
1404 1194 US4,NH120 200400004200400110010000000200900004028000000000200000000006

There's a diff in the character representing travelers 208-211. Twos bit stores traveler 209: terescoj.

Notes:

yakra commented 5 years ago

significant performance hit while Augmenting travelers for detected concurrent segments.

This was due to a bug in the reworked HighwaySegment::add_clinched_by. Fixed.

yakra commented 5 years ago

Grumble. Hex codes are looking more deterministic, but I can tell the final character frequently gets mangled. With 238 list files, we should only have the ones bit or twos bit set here, and thus have nothing > 3. But I'm seeing lots of 6s I'm not credited for NH NH121A HamRd FreRd, but I am credited for NH NH121A OdeRd HamRd... Mumble. Edge compression, calculating codes, assigning traveler numbers... there's a lot of moving parts here...

Buggy code at: https://github.com/yakra/DataProcessing/tree/tmg2_cpp Diffs from the current master C++: https://github.com/yakra/DataProcessing/compare/tmg2_cpp

yakra commented 5 years ago
std::string HighwaySegment::clinchedby_code(std::list<TravelerList*> *traveler_lists)
{   std::string code;
    std::vector<unsigned char> clinch_vector(ceil(traveler_lists->size()/4)*4, 0);
    for (TravelerList* t : clinched_by)
        clinch_vector[t->traveler_num] = 1;
    for (unsigned short t = 0; t < traveler_lists->size(); t += 4)
    {   unsigned char nibble = 0;
        nibble |= clinch_vector[t];
        nibble |= clinch_vector[t+1]*2;
        nibble |= clinch_vector[t+2]*4;
        nibble |= clinch_vector[t+3]*8;
        if (nibble < 10) nibble += '0';
        else nibble += 55;
        code.push_back(nibble);
    }
    return code;
}

->

std::string HighwaySegment::clinchedby_code(std::list<TravelerList*> *traveler_lists)
{   std::string code;
    std::vector<unsigned char> clinch_vector(ceil(traveler_lists->size()/4)*4, 0);
    for (TravelerList* t : clinched_by)
        clinch_vector.at(t->traveler_num) = 1;
    for (unsigned short t = 0; t < traveler_lists->size(); t += 4)
    {   unsigned char nibble = 0;
        nibble |= clinch_vector.at(t);
        nibble |= clinch_vector.at(t+1)*2;
        nibble |= clinch_vector.at(t+2)*4;
        nibble |= clinch_vector.at(t+3)*8;
        if (nibble < 10) nibble += '0';
        else nibble += 55;
        code.push_back(nibble);
    }
    return code;
}

...and now I can see stuff breaking:

[83.2] Writing master TM collapsed graph file, tm-master.tmg.
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 236) >= this->size() (which is 236)

OK, now I'm getting somewhere...

Oddly, in the same session, we have processed 238 traveler list files....

yakra commented 5 years ago

traveler_lists->size() -> double(traveler_lists->size())

jteresco commented 5 years ago

Wow, thanks @yakra for jumping on this! It would be really great to get a preliminary set of graphs, even if not perfect, generated this week so I could share the files and file format with my students who might want to work with this data for an upcoming project.

yakra commented 5 years ago

Just committed that last bugfix; graphs are looking good so far.

Next up, uploading the latest code to noreaster, compiling, and producing some graphs there.

yakra commented 5 years ago

preliminary graphs: http://yakra.teresco.org/tmtools_demos/tmg_2.0a/ source code: https://github.com/yakra/DataProcessing/tree/tmg2_cpp/siteupdate/cplusplus

jteresco commented 5 years ago

This is fantastic.

A couple quick requests if you have time to make it happen.

yakra commented 5 years ago

let's only generate the collapsed format for these

Already done; I never changed over the simple format. The new format did replace the existing collapsed format, though, Are you thinking, retain the existing collapsed format, and add the new "traveled" one as well, making three versions of each graph instead of two?

The tricky bit:

let's give the format a new name, like "traveled"

NH-region-traveled.tmg, containing TMG 2.0 traveled in the header? Do the new graphs look OK otherwise?

jteresco commented 5 years ago

Yes, I'm thinking 3 for each instead of 2. I don't want to just replace the old ones since there's a lot of code, HDX and some of my Java code for class stuff, that I don't want to have to rewrite to ignore the new information.

I'll worry later about getting the graphs download page to find the new stuff.

I hope to look at the new graphs and try some things out with them in the next day or two.

jteresco commented 5 years ago

Quick thought that you are free to ignore, @yakra. Many of our graphs will have lots of segments with very few travelers, meaning many 0 digits. I wonder about an alternate "sparse" format for the codes in those cases, where we instead have a list of actual traveler numbers, maybe again in hex to save a few characters. Plus side: save a little space. Negative side: complication of our format and the need to calculate both possibilities on graph generation and the ability to decode both possibilities on any code that loads in the graphs.

yakra commented 5 years ago
  • We'd need two sets of collapsed edges, one collapsed around these vertices and one not.
    • This would take a bit more brainstorming, but is not impossible.

On the plus side, I recently upgraded BiggaTomato to 8 GB, so I've less to worry about if we're introducing more data.

But, on to more brainstorming... A HighwayGraphEdgeInfo and a HighwayGraphCollapsedEdgeInfo are largely the same data structure, just that a HighwayGraphCollapsedEdgeInfo has the added intermediate_points list. Which can of course be empty. Indeed, a collapsed edge with no intermediate points is written the same as its corresponding simple edge once it makes it into the .tmg.

An idea to save RAM, that could also simplify (?) things if we start to get into 3+ graph data sets: Get rid of the simple edge class, and just use collapsed edge objects.

Yes, I'm thinking 3 for each instead of 2. I don't want to just replace the old ones since there's a lot of code, HDX and some of my Java code for class stuff, that I don't want to have to rewrite to ignore the new information.

I left the original collapsed_tmg_line function in; it's just unused at the moment. Just have to sort out how to work with the 3 datasets.

Quick thought that you are free to ignore, @yakra. Many of our graphs will have lots of segments with very few travelers, meaning many 0 digits. I wonder about an alternate "sparse" format for the codes in those cases,

One thought I had was run-length encoding for zeros. Follow a 0 by a character (0-F?) encoding how many (additional) zeros there are.

where we instead have a list of actual traveler numbers, maybe again in hex to save a few characters. Plus side: save a little space. Negative side: complication of our format and the need to calculate both possibilities on graph generation and the ability to decode both possibilities on any code that loads in the graphs.

Thinking about this some more, I bet the RLE scheme might save more space, and remove the need to determine which encoding scheme to read/write.

jteresco commented 5 years ago

For the moment, it would be really great and get me through the initial usage this semester if you could generate the set again but with "traveled" instead of "collapsed" in both the filenames and the first line of each file. Then we can think about how best to proceed longer-term.

jteresco commented 5 years ago

Also, don't feel any obligation to generate new graphs based on my previous comment. The ones there can easily be used by my students. I might just rename them and change the one word at the top and go with them.

I can't wait to have time to enhance HDX to support the new format and experiment with ways to visualize this info.

yakra commented 5 years ago

run-length 0 encoding

First draft: http://yakra.teresco.org/tmtools_demos/tmg_2.0a/ (same link as above) RLE encoded: http://yakra.teresco.org/tmtools_demos/tmg_2.0a_rle/

When a 0 is encountered in a clinchedby_code string, the following character encodes the number of additional 0s to follow. Thus 1000 998 TCHPEI,PE1 20B2021001403200101404200C0B207 46.190675 -63.37085 expands to 1000 998 TCHPEI,PE1 20000000000002000101400002010040000020C000000000000200000000 46.190675 -63.37085 (same example, with color-coding for a visual aid, on the forum)

Run lengths of up to 16 (encoded as 0F) are supported. So right now, with 238 travelers, a segment clinched by nobody 000000000000000000000000000000000000000000000000000000000000 compresses to 0F0F0F0B.

Sure, we could try to encoder longer strings of zeros, but that would probably involve the use of a delimiter (unless we move beyond standard hex digits). • This would use more space for strings of 1-16 zeros, • and yield no savings for strings of 17-32 zeros (0F00 -> 010| ... 0F0F -> 01F|). • We'd only save space on strings > 32 zeros long. Untraveled (by anyone) segments aside, how many of cases these are there likely to be? They could possibly even outweighed by the shorter strings requiring a longer encoding. Space savings could be minimal, maybe even slightly increased. And then, things start to get a little more computationally complex too. KISS principle.

Overall graph size is reduced by ~35%. The only file to get larger (due to more 0->00 conversions than strings of 3+ zeros) was usai-system.tmg, which grew from 2188732 to 2235543 B.

jteresco commented 5 years ago

I have what I need for now - I grabbed the graphs in tmg_2.0a, renamed the collapsed files and replaced "collapsed" with "traveled". Many, many thanks for getting this off the ground so quickly.

jteresco commented 5 years ago

Regarding RLE vs. the hex string, I can see advantages to either. The hex string is a little simpler to deal with, but 35% savings in graph size is something to consider.

On the RLE side, I agree limiting to a single hex digit per "run" makes a lot of sense to me.

As far as usability for my purposes, I think the hex string will be easier to explain and a tad easier to use.

I'm going to build a Java program today to read in these graphs as a starting point for my students' projects.

yakra commented 5 years ago

As far as usability for my purposes, I think the hex string will be easier to explain and a tad easier to use.

OK, let's go with that then. If the students are just starting to learn about this, let's not go confusing matters with extra info or procedures.

I'm going to build a Java program today to read in these graphs as a starting point for my students' projects.

I can't wait to have time to enhance HDX to support the new format and experiment with ways to visualize this info.

Same here; it'll be fun to see the results.

I have what I need for now - I grabbed the graphs in tmg_2.0a, renamed the collapsed files and replaced "collapsed" with "traveled".

OK, good to hear. I'll save that change for when I make the changes for 3 graph sets, and avoid cluttering the commit history with short-lived edits that get undone in the next version.

Odds & ends

yakra commented 5 years ago

Just dumping this here in case I want to refer to it in the future.

std::string HighwaySegment::clinchedby_code(std::list<TravelerList*> *traveler_lists)
{   // Return a hexadecimal string encoding which travelers have clinched this segment, for use in "traveled" graph files
    // Each character stores info for traveler #n thru traveler #n+3
    // The first character stores traveler 0 thru traveler 3,
    // The second character stores traveler 4 thru traveler 7, etc.
    // For each character, the low-order bit stores traveler n, and the high bit traveler n+3.
    std::string code;
    std::vector<unsigned char> clinch_vector(ceil(double(traveler_lists->size())/4)*4, 0);
    for (TravelerList* t : clinched_by)
        clinch_vector.at(t->traveler_num) = 1;

    bool prev_zero = 0;
    for (unsigned short t = 0; t < traveler_lists->size(); t += 4)
    {   unsigned char nibble = 0;
        nibble |= clinch_vector.at(t);
        nibble |= clinch_vector.at(t+1)*2;
        nibble |= clinch_vector.at(t+2)*4;
        nibble |= clinch_vector.at(t+3)*8;
        if (nibble < 10) nibble += '0';
        else nibble += 55;
        // run-length encoding for zeros
        if (nibble == '0')
             {  if (code.empty() || code.back() == 'F')
                prev_zero = 0;
            if (prev_zero)
                if (code.back() == '9') code.back() = 'A';
                else code.back()++;
            else {  prev_zero = 1;
                code += "00";
                 }
             }
        else {  prev_zero = 0;
            code.push_back(nibble);
             }
    }
    return code;
}

More efficient to move the "int"->printable char conversion into the final else block, but that's all academic if this won't be implemented anyway.

yakra commented 5 years ago

Note to self, remember to:

yakra commented 5 years ago

Progress report:

siteupdate.cpp compiles, but will not run properly yet, as I still need to address the items in the list above.

yakra commented 5 years ago

The new code is mostly in place. There are several bugs to sort out, and the first one is a doozy. Before I get into detail on what's happening, I'll just note what I've been using for testing: https://github.com/TravelMapping/HighwayData/tree/23c9c806d0432a796bf5fcfd90b3f7d5b4cce9d4 https://github.com/TravelMapping/UserData/tree/2be0b9f0e13c14d05b399cf5c76f9c3e0097258f

yakra commented 5 years ago

~Current code is at https://github.com/TravelMapping/DataProcessing/tree/tmg2trav_bugs~ <--branch deleted or for a more permanent link, ~here~ here. <--link still works tmg2_cpp branch, also deleted, was at a6b53ef1ebe13ee34a657c256e583b9a85e76088

I think I'm writing, and maybe also reading, thru a rogue pointer. The symptoms are varied, and... interesting. I'll skip the detailed description unless anyone really wants to know, but suffice it to say that things are pretty broken and unusable right now. I have some ideas where to start looking, but fixing this should take a bit of time & a good amount of new debug code.

It may be more productive in the near-term to try an implementation in siteupdate.py I may bounce back & forth between that & debugging this...

yakra commented 5 years ago

https://github.com/TravelMapping/DataProcessing/blob/727c00078d76c15b8952dadfa9bd9a1e12f4a16a/siteupdate/cplusplus/classes/GraphGeneration/HGCollapsedEdge.cpp#L38-L47

Cue the "One cannot simply" meme. We'll need separate construction processes for collapsed and traveled edges. When creating a new traveled edge, we need to look at vertex->incident_t_edges instead, because incident collapsed edges & incident traveled edges can differ.

I don't believe this bug is responsible for the full set of symptoms I was seeing; I think it should only result in garbage graph data. Still needs to be fixed though.

How to most efficiently do this will require a bit of thought, for C++ and Python alike.


Meanwhile, I've been working on the Python implementation. Taking a more staged, step-by-step approach.

yakra commented 5 years ago

Hidden vertices not collapsed in traveled graphs

These have exactly two incident edges, whose sets of clinched_by travelers are not identical. (HighwayData @ a113955992c373c22e7d5e053b4c26912fe0ee97) (UserData @ bab03f1477df35fcbe6c2684423b0cf967a3e6e6)

routebr>@<brwaypoint segment 1 cl.
by
only on
segment 1
segment 2 cl.
by
only on
segment 2
ak.ak002br>@<br+SuzAve AK AK2 HagAve +SuzAve 6 bhemphill AK AK2 +SuzAve +X235384 5  
ca.i605br>@<br+10A CA I-605 10 +10A 37 johninkingwood CA I-605 +10A 11 36  
co.co145br>@<br+X05 CO CO145 ManAve +X05 10   CO CO145 +X05 CR578 11 oscar
ct.ct012br>@<br+GunRd CT CT12 CT184 +GunRd 17 froggie CT CT12 +GunRd CryLakRd 16  
eng.a0057br>@<br+X07 ENG A57 +X06 +X07 0   ENG A57 +X07 +X08 1 si404
eng.a0551br>@<br+X02 ENG A551 A554_E +X02 0   ENG A551 +X02 A553 1 si404
eng.a3051br>@<br+X01 ENG A3051 A334 +X01 2 si404 ENG A3051 +X01 SwaLn 1  
ky.ky2373br>@<br+End KY KY2373 RigRd +End 2   KY KY2373 +End KY3076 3 ohroadscholar
mi.us023br>@<br+MI108 MI US23 HisMillCrkSP +MI108 14   MI US23 +MI108 I-75(338) 16 afarina bhemphill
nh.nh120br>@<br+x4plex NH US4 NH120_N +NH120 16 terescoj NH US4 +NH120 NH120_S 15  
nj.i080br>@<br+47 NJ I-80 47A +47 83 jayhawkco mr_matt_k NJ I-80 +47 47B 81  
nj.i676br>@<br+5A NJ I-676 5 +5A 69   NJ I-676 +5A 5B 71 justjake rlee
nl.nl510br>@<br+X442854 NL NL510 +X797607 +X442854 1 oscar NL NL510 +X442854 +X721066 0  
nl.nl510br>@<br+X721066 NL NL510 +X442854 +X721066 0   NL NL510 +X721066 +X614767 1 oscar
sc.i077br>@<br+9 SC I-77 9A +9 67 rlee SC I-77 +9 9B 66  
svk.d001br>@<br+1A SVK E58 1(D1) +1A(D1) 10 cinx SVK E58 +1A(D1) 4(D1) 9  
tn.us011br>@<br+I-40(380) TN US11 MonRd +I-40(380) 13 Ugnaught2 TN US11 +I-40(380) BucDr 13 jonfmorse
wa.wa020br>@<br+x05 WA WA20 MainSt_Cou +x05 15 chaddean WA WA20 +x05 LibRd 15 mapcat
wa.wa542br>@<br+x110 WA WA542 +x108 +x110 8 terescoj WA WA542 +x110 +x114 7  
yt.yt010br>@<br+DolVarCrk YT YT10 +X195719 +DolVarCrk 1 oscar YT YT10 +DolVarCrk +X553466 0  

I'll give these a look over, and for the ones in my regions, see about either unhiding them or collapsing into visible points.

yakra commented 5 years ago

Python progress report:

yakra commented 5 years ago

subgraphs & traveler_num implemented. I'm finding creating the hex codes a bit more awkward in Python. Giving my latest attempt a try.

yakra commented 5 years ago

Hex codes are in place. The first couple I checked out look OK, but doing a little more QA is still a good idea. Afterwards...

Once the Python changes get merged into master (or I suppose even before), I'll start the C++ implementation over, mirroring the changes commit-by-commit.

yakra commented 5 years ago

Did more QA. Looks good.

Time trials on BiggaTomato: Graph generation altogether takes ~ 48-50% more time.

Potential speed improvements:

Subgraph generation still goes by blazingly fast. This when computing the clinchedby_code for each edge as it's written. I haven't checked on how fast it's going now, or how fast it was going before, but a possible speed enhancement would be to compute & store a clinchedby_code once for each HighwaySegment ... This code can be retrieved (rather than recalculated) and written when its corresponding edge is written.

This conflicts with #199. We can have one enhancement or the other, but not both. I'll still code this anyway, in a separate branch, because I'm curious to see the results.

yakra commented 5 years ago

Storing & retrieving clinchedby_codes yields roughly 10% time savings, or 18.2s on noreaster.

Another way of looking at it is that full TMG 2.0 traveled implementation added 66s of processing time vice 15c7bd610c526a2cc79b4639f9134b9582de8ad7, so we've managed to trim back > 27% of that. But still... outside this context, 18.2s savings altogether is a bit underwhelming.

Edit: Add in > 28 MB of RAM for the strings, and I'm less than wild about this. Permanent links, for when I delete the branch: https://github.com/TravelMapping/DataProcessing/commit/472832867bc5ea1e741a905185e18157f5823c9b https://github.com/yakra/DataProcessing/commit/769b7eb9bf3d9b15aaaf5f377c822da87c4abe29

I'll have a crack at #199, and see what filesize savings and processing time penalty it yields. Edit: #199 takes only 4.7s (3%) more on BiggaTomato. Let's go that route...

yakra commented 5 years ago

The C++ redo is going well. Just squished an obscure bug that resulted in traveled edges with missing intermediate points. Same stage of development as siteupdate.py was in https://github.com/TravelMapping/DataProcessing/issues/196#issuecomment-476916798:

* Raw collapsed or traveled TMGs are not directly DIFFable, because vertices, edges and intermediate points are not ordered deterministically. This required an intermediate step, a separate program to sort & reorder TMGs into a "canonical" file before DIFFing.

yakra commented 5 years ago

C++:

yakra commented 5 years ago

HidClPtsLog

Dead-end branch, to be deleted. Temporary log file used to make sure I was on the right track during Python development. It can be automatically merged back to into yakra/tmg2_Python, but won't work without small modifications, as the is_hidden variable no longer exists.

tmg2_
Python
HidCl
PtsLog
commit
fbd4d60c93156a78af725ab4f82ade943f03269a Merge pull request #71 from yakra/tmg2_Python
71bbd81407d5b00b2cd76d6f2c9fdd42d3162b35 comment: region -> segment
6c30ad4249a1173df236f1a724eea6befaee401d hidden_clinch_points.log
46698286184fa2dcf24316caf7e27c6d876f7246 HGEdge: store canonical HighwaySegment instead of region

Rather than have this as its own discrete loop, it could be integrated into edge compression. How much extra time does it take up?

yakra commented 5 years ago

I might have left this open till both #201 & #209 got merged, IIRC. @jteresco, OK to close this?

jteresco commented 5 years ago

Yes, I think so. Closing.