Closed Robinlovelace closed 2 years ago
Proposed input data for destinations below, open data from OSM obtained using reproducible script.
Any comments at this early stage welcome @dabreegster.
See #8
Based on reproducible example shown below
I understand, thanks! (Earlier on Slack, I was confused, but you're basically trying to specify different subpoints and weights for origins and destinations.) The CLI, example data, and format all look reasonable -- I'll see if I can implement this tomorrow
A few questions based on the new dataset...
1) In your first command, I assume not specifying origin subpoints means to randomly pick points?
2) od_schools.csv
has per-mode columns, but not an all
column like we've been doing. I can make that field optional and just sum everything up, at the cost of a little more complexity in odjitter. I'm not a huge fan of accepting a variety of input formats, because then you have to clearly document all of them, but this seems like a small case to support. Any thoughts? If people providing the input can easily sum the column on their end, I'd prefer that. (I could run your repro and add the column if so as some R practice!)
A naive approach, that I think should at least provide an output (but errors when I try it) is as follows:
That's because of a TODO
in the code -- we only scrape subpoints from line-string input. I'll add support for geojson containing points. (And we should support all geojson types)
Started https://github.com/dabreegster/odjitter/tree/weighted_subp. Few things left to do. First I want to agree on the interface, though. I think I have a strong preference for having a simple API / surface area, instead of taking a bunch of variations of input. Especially if the caller is using Python or R or some other language where it's trivial to do data transformation, I think doing the "reshaping" work there is simpler.
So specifically:
1) Separate --subpoints-origins-path
and --subpoints-destinations-path
flags. If the caller wants to use the same subpoints for both, they repeat the path -- no extra flags or handling for that. There could be cases where scraping and storing the subpoints once for both groups could be advantageous performance wise, but I'll add that optimization only when it's really needed.
2) The CSV input requires some column for "all trips." So the test dataset you added, I propose we add the column there and always require it input.
Thanks for the questions, on the case, replies coming up...
First re later replies:
- Separate
--subpoints-origins-path
and--subpoints-destinations-path
flags. If the caller wants to use the same subpoints for both, they repeat the path -- no extra flags or handling for that. There could be cases where scraping and storing the subpoints once for both groups could be advantageous performance wise, but I'll add that optimization only when it's really needed.
Fully agreed, more specific is better and bindings can do the shortcuts.
2. The CSV input requires some column for "all trips." So the test dataset you added, I propose we add the column there and always require it input.
Yes that was deliberate but instead of changing the data I suggest setting the 'all' one to number of cars, for example. I will change the description of the CLI call above, setting all-key
. Many OD datasets lack an all column so good to have test data that covers those cases in the wild.
I suggest setting the 'all' one to number of cars
The first row is S02001616,S02001616,232,8,70,0
-- cars is 0, walk 232, bike 8, other 70. Would that make sense there?
I'm sure many datasets don't have an all
column, but I'm proposing we require it for input into this tool. I don't think odjitter should be robust to a bunch of different types of input, when that data cleanup / transformation is going to happen anyway in a language better suited for it
I'm sure many datasets don't have an
all
column, but I'm proposing we require it for input into this tool. I don't think odjitter should be robust to a bunch of different types of input, when that data cleanup / transformation is going to happen anyway in a language better suited for it
Fine by me. In cases where the user is interested in bike trips, for example, and --all-key
is set to bike
, there would be no need to have an all
column in the CSV dataset, right?
I guess not. But all the numeric columns will be scaled by bike / max_per_od
, and that might be nonsensical.
So for the unit test of the school data, shall we use the car or bike column? Or add all?
I guess not. But all the numeric columns will be scaled by
bike / max_per_od
, and that might be nonsensical.
I don't think so. I think each numeric output would be scaled by the number of disaggregated OD pairs per original OD pair. That would be determined by n. bike.
So for the unit test of the school data, shall we use the car or bike column? Or add all?
I vote car but any should work. It depends which one is the focus. For current work looking at baseline cycle networks bike would make sense but alas the flows are vanishingly small in many areas, including in this synthetic test dataset.
So say we set all-key to "bike", and for some desire line, bike has 0, but car and walk have some trips. What should we do -- zero out car and walk too? (This is not hypothetical, it's happening in the test I'm setting up :) )
So say we set all-key to "bike", and for some desire line, bike has 0, but car and walk have some trips. What should we do -- zero out car and walk too? (This is not hypothetical, it's happening in the test I'm setting up :) )
It's great that the example data has this edge case. In that case there will be no disaggregation: the max-per-od
argument only kicks in to split up lines into more than 1 sub-line when there are more than that value in the all-key
column. Maybe that argument should be threshold-key
as it's the key that, when values go above a certain threshold value, triggers disaggregation.
Done, and with weights, in #14
Another sanity check just implemented:
Thanks @dabreegster, I go to bed happy : )
In many OD datasets the locations of destinations (e.g. work places, shops, schools) are different than the locations of the origins (e.g. residential buildings). Some destinations attract more trips than others, so weighting values are probably also needed.
Based on input data in #8, I imagine this could work something like this:
A naive approach, that I think should at least provide an output (but errors when I try it) is as follows:
Illustration of what the output could look like (with
--max-per-od 1000
in this case):