traveled graphs: record only relevant travelers

TravelMapping / DataProcessing

Data Processing Scripts and Programs for Travel Mapping Project

4 stars 6 forks source link

traveled graphs: record only relevant travelers #199

Closed yakra closed 5 years ago

yakra commented 5 years ago

A side topic to #196:

@jteresco noted that many of our graphs will have lots of segments with very few travelers, meaning many 0 digits, and floated the idea of saving space in traveled graph files. https://github.com/TravelMapping/DataProcessing/issues/196#issuecomment-472569800

I considered run-length encoding of 0s, ultimately dropping the idea on simplicity grounds. https://github.com/TravelMapping/DataProcessing/issues/196#issuecomment-473155091

An idea that will help out in subgraphs is to only record travelers who've clinched at least one segment contained in the graph. For example:

in TX, right now there are only 129 travelers, meaning we can use 33 characters per clinchedby_code, instead of 60.
in PE, right now there are only 17 travelers, meaning we can use 5 characters per clinchedby_code, instead of 60.

Advantages: Requires no changes to the TMG 2.0 traveled "specification", keeping the hex clinchedby_code encoded simply & intuitively.

Limitations: The more travelers we have in a dataset, the smaller the savings, all the way up to tm-master-traveled.tmg, with no savings at all (assuming each traveler has at least one segment traveled).

Disadvantages: Using the full traveler list for all graphs, we can potentially save time (how much?) by generating clinchedby_codes during graph setup, and looking them up, instead of regenerating, for all subgraphs. With subgraphs using different traveler lists and thus different clinchedby_codes for each segment, this is no longer an option.

Implementation: We'd have to reassign traveler numbers for each graph.

Python, with graphs generated one at a time, is no big deal. Just reuse TravelerList.traveler_num.
C++, with multi-threaded subgraph generation, is more complicated.
- Either create a TravelerList->traveler_num hash table for each graph, or
- Better yet (faster/less complicated), store TravelerList::traveler_num as an array of numthreads elements, and reassign for each graph.

yakra commented 5 years ago

On BiggaTomato, this takes only 4.7s (3%) longer than the code in #201.

Some stats:

all traveled graphs: 239.9 MB -> 193.3 MB (80.6%)
tm-master-traveled.tmg stays at 55.8 MB
all subgraphs: 184.2 MB -> 138.2 MB (75.0%)
area graphs: 2.7 MB -> 1.9 MB (70%)
region graphs: 54.6 MB -> 34.0 (62.3%)
system graphs: 12.5 MB -> 11.0 (88.0%)
- cantch: 294 kB -> 209 kB (71.0%)
- eure: 3.7 MB -> 2.5 MB (68%)
- usai: 2.2 MB -> 2.1 MB (95%)
- usaus: 6.4 MB -> 6.1 MB (95%)
multisystem graphs: 11.3 MB -> 10.2 MB (90.3%)
- EUR-freeways: 2.4 MB -> 1.6 MB (67%)
- USA-national: 9.0 MB -> 8.7 MB (97%)
multiregion graphs: 6.5 MB -> 3.8 MB (58%)
- british-isles: 3.7 MB -> 2.0 MB (54%)
- CAN-maritimes: 753 kB -> 426 kB (56.6%)
- USA-newengland: 1.4 MB -> 1.1 MB (79%)
- west-indies: 577 kB -> 298 kB (51.6%)
country graphs: 41.3 MB -> 33.1 MB (80.1%)
continent graphs: 55.3 MB -> 44.3 MB (80.1%)

Some examples:

MEX-QROO-region-traveled.tmg has 5 travelers:

TMG 2.0 traveled
3 2
MEX180D@MEX307 21.017955 -86.854477
MEX180D@YUC/QROO 20.878541 -87.608242
MEX180D@MEX180 21.097814 -86.975434
2 0 MEX180D 30
2 1 MEX180D F1 20.97242 -87.208314 20.950859 -87.293158 20.897907 -87.441688
epzik8 griffith ngrier sneezy ua747sp

MEX-CAM-region-traveled.tmg has 1 traveler:

TMG 2.0 traveled
3 2
MEX180D@MEX180_W 19.509962 -90.701966
MEX180D@MEX180_E 19.797536 -90.527494
MEX180D@CAM60 19.636525 -90.671883
2 1 MEX180D 1 19.705627 -90.627036
0 2 MEX180D 1
ua747sp

GNQ-region-traveled.tmg has no travelers:

TMG 2.0 traveled
6 4
N5Med/N5@GAB/GNQ 1.002709 10.779648
N5Med/N5Kom@GNQ/GAB 1.002709 10.762997
N5@Efo 1.009875 10.902643
N5/N5Kom@GAB/GNQ 1.002709 10.725231
N5@Med 1.017255 10.797157
N5Sam/N5@GNQ/GAB 1.00288 10.907192
2 5 N5 
3 1 N5Kom 
4 2 N5  1.025665 10.859213
0 4 N5

In this last example, note the absence of travelers listed at the end, and the extra space between the label and the shaping points, where the clinchedby_code would otherwise go. Something to watch out for when writing a routine that reads these files. @jteresco, thoughts?

jteresco commented 5 years ago

Does it make sense to record a 0 there when there are no travelers, just to simplify reading?

yakra commented 5 years ago

Bear with me; thinking this thru out loud...

Simply having two spaces between the label & shapers can be less than straightforward, depending on the specifics of the language one is working with. For example, take
```
4 2 N5  1.025665 10.859213
```
- In Python .split()ing the line yields 6 fields, the 4th of which has zero length. SFSG...
- In C++, we can't use strtok, as that will result in 5 fields. We'll have to find a different solution.
A single 0 at first made me bristle a bit, as that to me suggests 1-4 travelers. But is it really so bad? We're already dealing with cases where there are fewer travelers than the array can store (1/4 in MEX-CAM; 5/8 in MEX-QROO).
Another option would be to store the number of travelers in the header:
```
TMG 2.0 traveled
6 4 0
```
- Advantages: Marginally less storage space in zero-traveler graphs. We know how many travelers there are before reading the list in the last line; this could potentially simplify things down the road.
- Disadvantages:
- It's still a tiny bit more complicated to code: If num_travelers == 0, don't attempt to read a hex code.
- This means a change to the TMG 2.0 traveled "specification", making the files the the students have obsolete, as well as, potentially, code to read & process the files.

To just give a straight answer to your question:

Does it make sense to record a 0 there when there are no travelers, just to simplify reading?

Yes.

This would simplify reading, and as such, makes sense. It's effectively just another normal case of extra bits at the end of the array that aren't used by any traveler.

yakra commented 5 years ago

I almost went & implemented the single-0 hex code for zero-traveler graphs, but had one thought...

Normally, a single 0 would mean one to four travelers. The single line at the end of the file lists who they are, and thus gives us the actual number of travelers: ABW-region-traveled.tmg: bejacob dharwood froggie kjslaughter AZE-region-traveled.tmg: ua747sp

Right now, I'm not writing a terminating line feed after the list of traveler names. Meaning, in cases where there are no travelers, there's one single line feed after the list of edges. Thus many text editors would interpret the final edge as being the final line. Seems a Good Thing would be to add in that one final line feed; that way, all traveled graphs would still contain the "line" listing all the travelers, even if it's a blank one listing nobody. @jteresco, thoughts as to implementation, for HDX and/or whatever else?

jteresco commented 5 years ago

I'd like to keep at least the blank line there. There's also minimal cost and a small potential benefit for those parsing the file to start the list of travelers with a traveler count. I think there's also a small potential benefit for that number of users and maybe even the user list to come before the edges that contain that info. The main downside to me is that it is a more significant deviation from the previous tmg file formats.

yakra commented 5 years ago

I'd like to keep at least the blank line there.

Agreed. Let's definitely do that. (Here, I'm interpreting the word "there" to mean "in the file", though not necessarily any specific place therein yet.)

There's also minimal cost and a small potential benefit for those parsing the file to start the list of travelers with a traveler count.

Not strictly necessary as I see it; when we encounter a string of n hex characters, we know we need an array of 4*n travelers. But I agree; minimal cost and a small potential benefit. Not knowing the number of travelers in advance, we'd have to do things a bit differently when parsing the first line, complicates things a bit. Having the number of travelers would simplify.

I think there's also a small potential benefit for that number of users and maybe even the user list to come before the edges that contain that info. The main downside to me is that it is a more significant deviation from the previous tmg file formats.

How about recording the number of travelers in the second header line as noted above, after the number of vertices & edges? (Minor drawback here is slightly different files from what you & your students already have. Existing code to read traveled TMGs may need to be changed.) The user list itself could stay at the end. IMO this makes parsing the file more convenient, while minimizing the deviation from the previous tmg file formats. The lines for vertices and edges would still be in the same place relative to the beginning of the file.

Shall we still keep the single-0 hex code for zero-traveler graphs, to simplify reading the edge lines? Otherwise, we'd have to account for two spaces between the edge label and shaping points (or a trailing space on lines with no shapers), which could be an annoyance.

yakra commented 5 years ago

The 3 steps I see doing to get this going:

~Definitely do~ DONE:

Add a terminating line feed, thereby guaranteeing a blank line (listing zero traveler names) for zero-traveler graphs.
- 48ed3df2ab3e5276997a819bab2a4019e9034175 #82 #83
- 12d192e119f720ca349a7fd1467b882e73912712

Most likely do:

Use a single-0 hex code for each edge in zero-traveler graphs. Arguably a more clear way to say "The hex code here actually stores no data; just ignore it" than having two whitespace characters between the edge name & shaping points.

To do or not to do?:

Add the number of travelers to the second header line. Pro: Knowing the traveler count in advance simplifies setup of the traveler array (or however we choose to implement things) when parsing the first hex code. Con: Slight difference to how things are done in the files we already have out there.
I don't see any drawback to keeping the user list at the end of the file. Users would simply be anonymous while parsing the edges, and then have their names assigned when finished.

Another idea that just occurred to me: Bypass writing a traveled graph altogether in cases where the graph has no travelers. I'm not sure how much I like this, though...

Slight pain to implement in Python.
A little bit more of a pain to implement in C++.
Potential to confuse students & other users when looking for a certain graph in the HDX or download page -- "Why is there no GNQ-region-traveled.tmg?"
Could needlessly complicate setting up the revamped graph download page.

It's probably best for consistency & simplicity's sake to include all three formats for every graph, even in cases where there are no travelers.

yakra commented 5 years ago

single 0 hex codes and # of travelers in header line implemented. Python: https://github.com/yakra/DataProcessing/pull/84 C++: 652efe3135ff6e65520a05c371d1a4dcce339c90