Route choice additions - Githubissues

Jake-Moss commented 8 months ago

Migrated from https://github.com/Jake-Moss/aequilibrae/pull/1

I don't think the approach I currently have is the way to go but it functions as a POC. I'll revamp it tomorrow utilising the table more now that I have a better understanding.

Ideally we'd be able to

Apply partitioning to the data, reading one origin shouldn't require reading all the origins,
but we should retain locality, reading an origin should (optionally) reading in all the relevant destinations.
And, store varying destinations, not all origins will have the same destinations.

Currently, the approach I had isn't able to be partitioned, although I think it does satisfy the other two.

To allow partitioning I think I'll have to move the origin index up to the main table, this would also let me drop the column index field and the struct entirely. But I think this would force the use of string keys for the columns in the table

Jake-Moss commented 8 months ago

I think a summary of my reading today would be useful. As I see it we have 3 real approaches to this.

Retain the whole route choice set in memory until we are done with computation. Then use the multithreaded writing options of Arrow. This could be either from Python or from Cython using the C++ interface instead.
- Implementation wise this would be the most straight forward, the stages of compute and dumping would be nicely separated and we'd be able to leverage just about any file formats or compression systems Arrow and Parquet have to offer.
- A major disadvantage would be the memory requirements. Not only would we have to hold everything in memory once, but ~2x that. Arrow object construction is by copy only, the objects are immutable and this is one way they can guarantee that, doing this from Python would also require first constructing Python objects from the C++ data structures, then copying it again to make the Arrow objects.
- This is definitely the most accessible and flexible option, even if we end up using the C++ interface to construct the Arrow objects before writing.
Utilise Parquet writer FileWriter to incrementally write a table at a time. It has more advanced options but those really get into the weeds.
- We'd have to create multiple tables but that should be fine. If it functions the same as the python API then it'll read back as one table.
- The C++ API does have an option to manually create and manage RowGroups through the use of the NewRowGroup function, this however does require manual sizing of the RowGroup and is really getting into the weeds of the Parquet format. It would require careful reading of the Parquet format spec and some statistical analysis of our data in order to not completely destroy the format.
- With either of these it would allow us to use a worker thread that consumers from thread safe queue of results to write, we'd be able to limit the number of results in memory at once and dynamically pause calculations to account for the speed of writing and utilise threads, compression, and anything else Parquet has to offer.
Ditch Parquet in favour of Arrow IPC. This is a pretty big change from what we discussed and has some caveats.
- Arguably the largest is it's not Parquet. We'd lose partitioning (though replaceable in a slightly less nice manner), high efficiency compression, and long term stability. Arrow IPC is designed for streaming data structures between Arrow processes in a standard, (near)zero-decoding format, it's exactly the same as how the data is represented in memory.
- We would have some compression, via IpcWriteOptions although the FAQ - What is the difference between Apache Arrow and Apache Parquet? notes that the Arrow IPC format is not meant for storage, and while it can be compressed, it's no where as close as Parquet can be.
- As Arrow IPC is not designed for storage, it's also not as stable as Parquet. Apache states what while compatibility between versions is supported it's not a priority.
- We'd also need to change how things are conceptually stored, ditching the Table format in favour of RecordBatchs. This isn't a real issue and arguable makes more since given our data.
- We would however, be able to take the same approach as 2. and have a worker thread consuming a queue of results to write. We'd have the same benefits of multithreaded IO and limiting memory usage.
- We also wouldn't lose the benefits of partitioning. The file would become one large file however it appears we would be able to read in an lookup table for our origin ids then reading in a particular index. How this is implemented I'm not sure but I don't think it would read the whole dataset then only keep the requested part.

Overall option 3. isn't great, it's just here for comparison, it would be much more appealing if we needed to stream this data over a (fast) network or operate directly on the resulting table. In terms of implementation easy 1. is by far would be the quickest and easiest, however it would come with the memory issues. I think option 2. is over the best bet however it does have more moving parts.

Optimistically I think 1. could be done in a day, with 2. taking maybe (?) 2-3, accounting for the idiosyncrasies of Cython, C++, and Arrow.

pedrocamargo commented 8 months ago

EXCELLENT summary, @Jake-Moss . I am inclined to say that option 2 is indeed the ideal, but there is no absolute need to have the saving of path files to be incredibly fast when performing assignment.

Maybe writing choice sets to be used later would be a welcome option, @jamiecook .

Jake-Moss commented 8 months ago

I've got an idea for using 2. in a much simpler, and easier to reasonable manner that I'm working on now. We can just batch the computation. Instead of using a complicated worker thread approach to save while generating I think we could just compute a portion of the results, switch to dumping them, switch back to computation. It wouldn't be as fancy or fast but much easier to implement, particularly as a first attempt.

janzill commented 8 months ago

Nice summary @Jake-Moss. Batching sounds reasonable, another approach could potentially be to not care about number of files and related RowGroups while processing data, but instead adding a post-processing step at the end.

Jake-Moss commented 7 months ago

I've just pushed the first form of real saving, test compatibility will come in a few moments.

I've managed to implement all the desired features with one real catch, writing to disk requires the gil, it is however, multi threaded on pyarrows end.

I attempted to keep all the disk writing and data transformation in Cython, interop-ing with the Pyarrow Cython and C++ API, however I've found the the Cython API is incredibly undocumented and fragile. Unfortunately working with the C++ API has its caveats and difficulties. I wasn't able to decipher it, perhaps someone with more C++ experience might have been able to, but I'm not sure the investment would be worth the rewards, I don't expect having to reacquire the gil after each batch to have a severe impact on runtime. In my explorations I've found,

multiple ways to crash the Cython compiler related to missing constructors,
order dependent cimports, and
issues related to non-existent const attributes where Cython gets confused

Due to the inflexibilities of the API, or rather my inability to bend it to my will, I've changed the data model slightly. Instead of having a row per origin id, which has a MapArray of destination ids to lists, we now have a row per OD pair with the list as top level value. I think is this both more flexible from a user perspective and easier to reason about. AFAIK this is just an overall better way to store things.

A small implementation detail to be aware of is that the batching for the disk writing requires that a whole partition be written at once. That is all OD pairs with a specific O must be computed, and dumped to disk at once. Otherwise the previous results will be overwritten. This isn't really an issue and is handled already but it's worth noting that the batching is as granular as it previously was. It also means that attempting to append to an existing dataset using an existing origin value will overwrite the stored results.

The current implementation allows for,

in memory construction only, returning a Pyarrow table,
immediate disk dumping (batched per origin), returning None,
hive dataset partitioning (keyed on origins), and
multithreaded reads, writes, and compute. The current serial choke point is the Pyarrow data structure construction.

Jake-Moss commented 7 months ago

For the modifications to CI https://arrow.apache.org/docs/python/integration/extending.html#building-extensions-against-pypi-wheels

Jake-Moss commented 7 months ago

I've done some profiling on the frequency calculation code I pushed yesterday and I believe it's fast enough to not be worth optimising further. On the QLD sample data it makes up about 0.1% of the runtime as reported by perf at 2000hz. Path finding is still the most time consuming part.

Jake-Moss commented 7 months ago

Regarding the CI, I attempted to fix it locally but had no luck. It's all build issues related to gcc failing to find libarrow. Pyarrow is meant to ship with this but with an ABI tag. The python -c "import pyarrow; pyarrow.create_library_symlinks()" call is supposed to fix this but it's not. A recommendation from and existing issue is to build against the conda version of pyarrow, which is not useful. https://github.com/apache/arrow/issues/22388

Jake-Moss commented 7 months ago

Despite CI failures, all tests pass locally with 78.80% coverage

Jake-Moss commented 7 months ago

Closes #510, closes #488

Jake-Moss commented 7 months ago

Merging this to make a clean slate for new API and assignment changes

AequilibraE / aequilibrae

Route choice additions #508