TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project
4 stars 6 forks source link

traveled graphs: record only relevant travelers #199

Closed yakra closed 5 years ago

yakra commented 5 years ago

A side topic to #196:

@jteresco noted that many of our graphs will have lots of segments with very few travelers, meaning many 0 digits, and floated the idea of saving space in traveled graph files. https://github.com/TravelMapping/DataProcessing/issues/196#issuecomment-472569800

I considered run-length encoding of 0s, ultimately dropping the idea on simplicity grounds. https://github.com/TravelMapping/DataProcessing/issues/196#issuecomment-473155091

An idea that will help out in subgraphs is to only record travelers who've clinched at least one segment contained in the graph. For example:

Advantages: Requires no changes to the TMG 2.0 traveled "specification", keeping the hex clinchedby_code encoded simply & intuitively.

Limitations: The more travelers we have in a dataset, the smaller the savings, all the way up to tm-master-traveled.tmg, with no savings at all (assuming each traveler has at least one segment traveled).

Disadvantages: Using the full traveler list for all graphs, we can potentially save time (how much?) by generating clinchedby_codes during graph setup, and looking them up, instead of regenerating, for all subgraphs. With subgraphs using different traveler lists and thus different clinchedby_codes for each segment, this is no longer an option.

Implementation: We'd have to reassign traveler numbers for each graph.

yakra commented 5 years ago

On BiggaTomato, this takes only 4.7s (3%) longer than the code in #201.

Some stats:

Some examples:

MEX-QROO-region-traveled.tmg has 5 travelers:

TMG 2.0 traveled
3 2
MEX180D@MEX307 21.017955 -86.854477
MEX180D@YUC/QROO 20.878541 -87.608242
MEX180D@MEX180 21.097814 -86.975434
2 0 MEX180D 30
2 1 MEX180D F1 20.97242 -87.208314 20.950859 -87.293158 20.897907 -87.441688
epzik8 griffith ngrier sneezy ua747sp 

MEX-CAM-region-traveled.tmg has 1 traveler:

TMG 2.0 traveled
3 2
MEX180D@MEX180_W 19.509962 -90.701966
MEX180D@MEX180_E 19.797536 -90.527494
MEX180D@CAM60 19.636525 -90.671883
2 1 MEX180D 1 19.705627 -90.627036
0 2 MEX180D 1
ua747sp 

GNQ-region-traveled.tmg has no travelers:

TMG 2.0 traveled
6 4
N5Med/N5@GAB/GNQ 1.002709 10.779648
N5Med/N5Kom@GNQ/GAB 1.002709 10.762997
N5@Efo 1.009875 10.902643
N5/N5Kom@GAB/GNQ 1.002709 10.725231
N5@Med 1.017255 10.797157
N5Sam/N5@GNQ/GAB 1.00288 10.907192
2 5 N5 
3 1 N5Kom 
4 2 N5  1.025665 10.859213
0 4 N5 

In this last example, note the absence of travelers listed at the end, and the extra space between the label and the shaping points, where the clinchedby_code would otherwise go. Something to watch out for when writing a routine that reads these files. @jteresco, thoughts?

jteresco commented 5 years ago

Does it make sense to record a 0 there when there are no travelers, just to simplify reading?

yakra commented 5 years ago

Bear with me; thinking this thru out loud...


To just give a straight answer to your question:

Does it make sense to record a 0 there when there are no travelers, just to simplify reading?

Yes.

This would simplify reading, and as such, makes sense. It's effectively just another normal case of extra bits at the end of the array that aren't used by any traveler.

yakra commented 5 years ago

I almost went & implemented the single-0 hex code for zero-traveler graphs, but had one thought...

Normally, a single 0 would mean one to four travelers. The single line at the end of the file lists who they are, and thus gives us the actual number of travelers: ABW-region-traveled.tmg: bejacob dharwood froggie kjslaughter AZE-region-traveled.tmg: ua747sp

Right now, I'm not writing a terminating line feed after the list of traveler names. Meaning, in cases where there are no travelers, there's one single line feed after the list of edges. Thus many text editors would interpret the final edge as being the final line. Seems a Good Thing would be to add in that one final line feed; that way, all traveled graphs would still contain the "line" listing all the travelers, even if it's a blank one listing nobody. @jteresco, thoughts as to implementation, for HDX and/or whatever else?

jteresco commented 5 years ago

I'd like to keep at least the blank line there. There's also minimal cost and a small potential benefit for those parsing the file to start the list of travelers with a traveler count. I think there's also a small potential benefit for that number of users and maybe even the user list to come before the edges that contain that info. The main downside to me is that it is a more significant deviation from the previous tmg file formats.

yakra commented 5 years ago

I'd like to keep at least the blank line there.

Agreed. Let's definitely do that. (Here, I'm interpreting the word "there" to mean "in the file", though not necessarily any specific place therein yet.)

There's also minimal cost and a small potential benefit for those parsing the file to start the list of travelers with a traveler count.

Not strictly necessary as I see it; when we encounter a string of n hex characters, we know we need an array of 4*n travelers. But I agree; minimal cost and a small potential benefit. Not knowing the number of travelers in advance, we'd have to do things a bit differently when parsing the first line, complicates things a bit. Having the number of travelers would simplify.

I think there's also a small potential benefit for that number of users and maybe even the user list to come before the edges that contain that info. The main downside to me is that it is a more significant deviation from the previous tmg file formats.

How about recording the number of travelers in the second header line as noted above, after the number of vertices & edges? (Minor drawback here is slightly different files from what you & your students already have. Existing code to read traveled TMGs may need to be changed.) The user list itself could stay at the end. IMO this makes parsing the file more convenient, while minimizing the deviation from the previous tmg file formats. The lines for vertices and edges would still be in the same place relative to the beginning of the file.

Shall we still keep the single-0 hex code for zero-traveler graphs, to simplify reading the edge lines? Otherwise, we'd have to account for two spaces between the edge label and shaping points (or a trailing space on lines with no shapers), which could be an annoyance.

yakra commented 5 years ago

The 3 steps I see doing to get this going:

~Definitely do~ DONE:

Most likely do:

To do or not to do?:


Another idea that just occurred to me: Bypass writing a traveled graph altogether in cases where the graph has no travelers. I'm not sure how much I like this, though...

It's probably best for consistency & simplicity's sake to include all three formats for every graph, even in cases where there are no travelers.

yakra commented 5 years ago

single 0 hex codes and # of travelers in header line implemented. Python: https://github.com/yakra/DataProcessing/pull/84 C++: 652efe3135ff6e65520a05c371d1a4dcce339c90