Closed Jake-Moss closed 7 months ago
I think a summary of my reading today would be useful. As I see it we have 3 real approaches to this.
RowGroup
s through the use of the NewRowGroup function, this however does require manual sizing of the RowGroup
and is really getting into the weeds of the Parquet format. It would require careful reading of the Parquet format spec and some statistical analysis of our data in order to not completely destroy the format.Table
format in favour of RecordBatch
s. This isn't a real issue and arguable makes more since given our data.2.
and have a worker thread consuming a queue of results to write. We'd have the same benefits of multithreaded IO and limiting memory usage.origin id
s then reading in a particular index. How this is implemented I'm not sure but I don't think it would read the whole dataset then only keep the requested part.Overall option 3.
isn't great, it's just here for comparison, it would be much more appealing if we needed to stream this data over a (fast) network or operate directly on the resulting table. In terms of implementation easy 1.
is by far would be the quickest and easiest, however it would come with the memory issues. I think option 2.
is over the best bet however it does have more moving parts.
Optimistically I think 1.
could be done in a day, with 2.
taking maybe (?) 2-3, accounting for the idiosyncrasies of Cython, C++, and Arrow.
EXCELLENT summary, @Jake-Moss . I am inclined to say that option 2 is indeed the ideal, but there is no absolute need to have the saving of path files to be incredibly fast when performing assignment.
Maybe writing choice sets to be used later would be a welcome option, @jamiecook .
I've got an idea for using 2.
in a much simpler, and easier to reasonable manner that I'm working on now. We can just batch the computation. Instead of using a complicated worker thread approach to save while generating I think we could just compute a portion of the results, switch to dumping them, switch back to computation. It wouldn't be as fancy or fast but much easier to implement, particularly as a first attempt.
Nice summary @Jake-Moss. Batching sounds reasonable, another approach could potentially be to not care about number of files and related RowGroups while processing data, but instead adding a post-processing step at the end.
I've just pushed the first form of real saving, test compatibility will come in a few moments.
I've managed to implement all the desired features with one real catch, writing to disk requires the gil, it is however, multi threaded on pyarrows end.
I attempted to keep all the disk writing and data transformation in Cython, interop-ing with the Pyarrow Cython and C++ API, however I've found the the Cython API is incredibly undocumented and fragile. Unfortunately working with the C++ API has its caveats and difficulties. I wasn't able to decipher it, perhaps someone with more C++ experience might have been able to, but I'm not sure the investment would be worth the rewards, I don't expect having to reacquire the gil after each batch to have a severe impact on runtime. In my explorations I've found,
cimport
s, andconst
attributes where Cython gets confusedDue to the inflexibilities of the API, or rather my inability to bend it to my will, I've changed the data model slightly. Instead of having a row per origin id, which has a MapArray of destination ids to lists, we now have a row per OD pair with the list as top level value. I think is this both more flexible from a user perspective and easier to reason about. AFAIK this is just an overall better way to store things.
A small implementation detail to be aware of is that the batching for the disk writing requires that a whole partition be written at once. That is all OD pairs with a specific O must be computed, and dumped to disk at once. Otherwise the previous results will be overwritten. This isn't really an issue and is handled already but it's worth noting that the batching is as granular as it previously was. It also means that attempting to append to an existing dataset using an existing origin value will overwrite the stored results.
The current implementation allows for,
None
, For the modifications to CI https://arrow.apache.org/docs/python/integration/extending.html#building-extensions-against-pypi-wheels
I've done some profiling on the frequency calculation code I pushed yesterday and I believe it's fast enough to not be worth optimising further. On the QLD sample data it makes up about 0.1% of the runtime as reported by perf at 2000hz. Path finding is still the most time consuming part.
Regarding the CI, I attempted to fix it locally but had no luck. It's all build issues related to gcc failing to find libarrow
. Pyarrow is meant to ship with this but with an ABI tag. The python -c "import pyarrow; pyarrow.create_library_symlinks()"
call is supposed to fix this but it's not. A recommendation from and existing issue is to build against the conda version of pyarrow, which is not useful. https://github.com/apache/arrow/issues/22388
Despite CI failures, all tests pass locally with 78.80% coverage
Closes #510, closes #488
Merging this to make a clean slate for new API and assignment changes
Migrated from https://github.com/Jake-Moss/aequilibrae/pull/1
I don't think the approach I currently have is the way to go but it functions as a POC. I'll revamp it tomorrow utilising the table more now that I have a better understanding.
Ideally we'd be able to
Currently, the approach I had isn't able to be partitioned, although I think it does satisfy the other two.
To allow partitioning I think I'll have to move the origin index up to the main table, this would also let me drop the
column index
field and the struct entirely. But I think this would force the use of string keys for the columns in the table