Open jcrist opened 9 years ago
Validating that input blocks don't overlap with existing blocks sounds like a clear win.
Appending to existing partitions might be useful but might also be more than we want to maintain within castra itself. @esc and I considered buffering blocks and appending to existing blocks this when we first built castra but decided against it in order to keep castra very simple. Our plan was to add this stuff on top of castra in external code. This came out of dealing with the bcolz codebase which, while much more fully featured that castra, is also more expensive to maintain. It may be that it's time to revisit this decision; I just wanted to share historical reasons on how we've tried to keep the core simple.
I like the extend sequence
idea. It matches how I tend to use castra today, e.g. for df in dfs: mycastra.extend(df)
, and aligns well with the idea that buffering logic can exist external to the existing model.
Appending onto existing blocks sounds like it might be tricky. I understand that you've been diving into bloscpack to do this. I suspect that this would marry castra and bloscpack more tightly than they are currently. This tight coupling concerns me, especially if we want to switch to using other compression libraries. This concern about marrying the two is motivated a bit by bloscpack not releasing the GIL, see https://github.com/Blosc/python-blosc/issues/101. I would be -0.5 on any change that removed this option going forward.
My thoughts:
Castras should have the following invariants:
[[1, 2, 3, 3], [3, 3, 4, 5, 6]...]
Additionally, having the index be a time series partitioned by some period is a common pattern. We should try to make this as easy as possible for users, while also ensuring the two invariants above.
In my mind, the following use case should work:
# Create a castra partitioned by day:
c = Castra('filepath', template=temp, partitionby='d')
# Add some existing data
c.extend_sequence(some_iterator)
c.close()
# Get new data at a later time, and add it, while keeping the partitioning scheme
c = Castra('filepath')
c.extend(df)
I really want to support this functionality, as it's something I would expect from a tool like this. Saying "this castra is partitioned by day" means to me that both extend
and extend_sequence
should respect that.
extend
will modify disk (slower), but ensures the partitioning scheme is kept. extend_sequence
will buffer during the function call, but after completion will drop the buffer (no internal state). Having both options available seems like a good idea to me.I'm not sure that castras should manage partition sizes; this may be a pandoras box (although if you have an implementation that does this well that could be a good counterargument.)
All use cases that I've come across would be satisfied by moving the partitionby keyword argument to
extend_sequence`.
# Create a castra
c = Castra('filepath', template=temp)
# Add some existing data, partitioned by day
c.extend_sequence(some_iterator, partitionby='d')
c.close()
Direct use of extend is up to the user to coordinate:
c = Castra('filepath')
c.extend(df) # user manages partition size directly
This keeps a lot of logic out of the actual castra object and yet satisfies most use cases I can think of. It's also something that I think can be done very cheaply.
If you have a castra that already exists on disk up to May 15, and you have a dataframe from May 16 to June 16, what does c.extend_sequence([df], partitionby='M')
do? What does c.extend(df)
do?
Or a simpler case, suppose you have a castra that has an index up to May 16, 0:00:00, and you have a dataframe with a few more datapoints at that same time. How can you add that dataframe to the castra, without modifying the existing partitions?
In the firs case I would expect to add two partitions with extend_sequence
and one partition with extend
In the second case I would expect castra to throw an error.
If there is an application where the second case ends up being really important (e.g. log files that come in slightly out of order) then that sounds like a motivating use case. Do you have such a case?
I don't, and I don't think castra should handle "out-of-order" data specifically. I do think it should work on overlapping order though (end castra index == start of next frame)
I do think that adding periodic new data to an existing castra is something that should work, and should be easy to do. Sometimes these datasets overlap. My main use-case is also covered by extend_sequence
, but this seemed like a good thing to add, especially if it's moderately cheap.
I just tried using df.to_castra(..., sorted_index_column='pickup_datetime')
on the nyctaxi dataset and got this error
ValueError: Index of new dataframe less than known data
So it looks like we are erring at least in the case of <
. This should probably be changed to <=
.
I've been working on a refactor of Castra - before I spend any more time on this, I should probably get some feedback. Here's the plan:
Issues I'm attempting to solve:
[[1, 2, 3, 3], [3, 3, 4, 5, 6], ...]
were possible (and happened to me)The plan:
partitionby=None
to theinit
signature. This will live inmeta
. IfNone
, no repartitioning is done by Castra. Can also be a time period (things you can pass toresample
in pandas).extend
checks current partitions for equality overlap (even ifpartitionby=None
). There are 3 cases that can happen here:partitionby != None
, then data is partitioned by Castra into blocks.extend
should still take large dataframes (calling extend on a row is a bad idea), but will group them into partitions based on the rule passed topartitionby
. Using the functionality provided by bloscpack, the on disk partitions can be appended to with little overhead. This makes writing in cases where this happens slightly slower, but has no penalty on reads.extend_sequence
function. This takes an iterable of dataframes (can be a generator), and does the partitioning in memory instead of on disk. This will be faster than callingextend
in a loop (no on disk appends), but will result in the same disk file format.This method means that the disk will match what's in memory after calls to
extend
orextend_sequence
complete, will allow castra to do partitioning for the user, and will ensure that the partitions are valid. I have a crude version of this working now, and have found writes to be only slightly penalized when appends happen (no penalty if they don't), and no penalty for reading from disk.