Open goodboy opened 4 years ago
Maybe also a small demonstration of the client behaviors:
isvariablelength=False
aka not an append but writing an explicit Nanoseconds
:nav] In [32]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140000)], dtype=[('Epoch', 'i8'), ('
...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'Monkey/1Sec/TICK', isvariablelength=False),
Out[32]: ({'responses': None},)
[nav] In [33]: client.query(pymarketstore.Params('Monkey', '1Sec', 'TICK')).first().df()
Out[33]:
Bid Nanoseconds
Epoch
2016-01-01 10:00:00+00:00 3.0 140000
[nav] In [34]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140001)], dtype=[('Epoch', 'i8'), ('
...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'Monkey/1Sec/TICK', isvariablelength=False),
Out[34]: ({'responses': None},)
[nav] In [35]: client.query(pymarketstore.Params('Monkey', '1Sec', 'TICK')).first().df()
Out[35]:
Bid Nanoseconds
Epoch
2016-01-01 10:00:00+00:00 3.0 140001
So if you never wrote Nanoseconds
and isvariablelength=False
then you don't get it magically created:
[ins] In [45]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3)], dtype=[('Epoch', 'i8'), ('Bid', 'f
...: 4'),]),'Monkey_NO_NANO/1Sec/TICK', isvariablelength=False),
Out[45]: ({'responses': None},)
[ins] In [46]: client.query(pymarketstore.Params('Monkey_NO_NANO', '1Sec', 'TICK')).first().df()
Out[46]:
Bid
Epoch
2016-01-01 10:00:00+00:00 3.0
isvariablelength=True
aka known as an append[nav] In [37]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140000)], dtype=[('Epoch', 'i8'), ('
...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'APPEND/1Sec/TICK', isvariablelength=True)
Out[37]: {'responses': None}
[ins] In [39]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140001)], dtype=[('Epoch', 'i8'), ('
...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'APPEND/1Sec/TICK', isvariablelength=True),
Out[39]: ({'responses': None},)
[ins] In [40]: client.query(pymarketstore.Params('APPEND', '1Sec', 'TICK')).first().df()
Out[40]:
Bid Nanoseconds
Epoch
2016-01-01 10:00:00+00:00 3.0 140000
2016-01-01 10:00:00+00:00 3.0 140001
But wait let's continue with that and find our magic Nanoseconds
created for us always:
[nav] In [41]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3)], dtype=[('Epoch', 'i8'), ('Bid', 'f
...: 4'),]),'APPEND/1Sec/TICK', isvariablelength=True),
Out[41]: ({'responses': None},)
[nav] In [42]: client.query(pymarketstore.Params('APPEND', '1Sec', 'TICK')).first().df()
Out[42]:
Bid Nanoseconds
Epoch
2016-01-01 10:00:00+00:00 3.0 0
2016-01-01 10:00:00+00:00 3.0 140000
2016-01-01 10:00:00+00:00 3.0 140001
^ That doesn't happen if you use isvariablelength=False
:
[ins] In [44]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3)], dtype=[('Epoch', 'i8'), ('Bid', 'f
...: 4'),]),'Monkey/1Sec/TICK', isvariablelength=True),
Out[44]:
({'responses': [{'error': 'unable to match data columns ([{Epoch INT64} {Bid FLOAT32}]) to bucket columns ([{Epoch INT64} {Bid FLOAT32} {Nanoseconds INT32}])',
'version': '34352c9738c9164d7c65264a532d99341c57fae2'}]},)
I'm looking at the relevant server code sections:
You know what's handy, putting the name
Append
somewhere in the func name :wink:Secondly,
Also it's probably worth mentioning that there's some kind of relationship with the
Nanoseconds
column. I got real confused when using the client to do stuff and got weird different numbers back depending on whether I wrote theNanoseconds
field and usedisvariablelength
(look, the unit tests and me are the same :smiling_face_with_three_hearts:).That is, if
isvariablelength
is set:ColumnSeries
to disk, the'Nanoseconds'
is removed before row conversion, which appears to be to avoid a mismatch error when comparing column field names of the previoustbi
(time bucket info) :thinking:RowSeries
: https://github.com/alpacahq/marketstore/blob/7537537da8c02d71179bb46d57e19ba7a7f86d01/utils/io/columnseries.go#L235 therowType
isn't passed through, so theDataShape{"Nanoseconds", INT32})
is never appended to thedataShapes
array which makes one wonder: "why aren't we passing it through because we need it when we read theNewRowSeries
comment:ColumnSeries.GetTime()
is called just before all this and later passed toWriter.WriteRecords()
..Ok so let's stop and think here.
We're removing
Nanoseconds
because, before writing to disk, we convertColumnSeries
->RowSeries
without passing through therowType
flag, which would makeNewRowSeries
add the'Nanoseconds'
DataShape
which we apparently need:But really it's because we already got the
Nanoseconds
out and are passing it as a[]time.Time
toWriter.WriteRecords()
?Uh, ok so I guess because the read() means when
GetTime()
get's called, or?Again, comment says we need this
Nanoseconds
field for "reading" andGetTime()
seems to need it for generating a[]time.Time
output, if there is aNanoseconds
column. Well that's good because (as mentioned in last bullet ^) we are calling it then handing it toWriter.WriteRecords()
.Let's note that
ColumnSeriesMap.FilterColumns()
method requiresNanoseconds
as part of the index.Ok where are we again?
ColumnSeriesMap
needsNanoseconds
for the index butColumnSeries
doesn'tColumnSeries
in theColumnSeriesMap
and remove theNanoseconds
fields because when converting to aRowSeries
we don't pass through therowType
flag which would add thatDataShape
forNanoseconds
times := cs.GetTime()
and eventually pass that to the writer routine whilst documenting in that method that we needNanoseconds
in for reading:exploding_head:. So this all seems pretty circular.
Alright let's go back to what we were doing. Right,
WriteCSM()
, we're writing ourColumnSeriesMap
to disk! https://github.com/alpacahq/marketstore/blob/88008f2a76ec926e9efd2093cb59462d5e8274ba/executor/writer.go#L282Ok so if a
tbi
don't exist andisVariableType
is set, we're gonna passrecordType=io.VARIABLE
toio.NewTimeBucketInfo()
.So we have a
ColumnSeries
with noNanoseconds
DataShape
(ifisVariableLength
is set) and we're making a newTimeBucketInfo
with a "variable length" meaning this stuff gets set: https://github.com/alpacahq/marketstore/blob/4811cc6a14a917e97261ce19b223d9b15325037e/utils/io/metadata.go#L85-L87Cool, let's go back to
WriteCSM()
...https://github.com/alpacahq/marketstore/blob/88008f2a76ec926e9efd2093cb59462d5e8274ba/executor/writer.go#L314-L335
GetDataShapesWithEpoch()
callsTimeBucketInfo.GetDataShapes()
which builds aDataShapeVector
that we're gonna compare against the same returned from theColumnSeries
that we just converted to aRowSeries
which we're actually going to write to disk.So if the columns in the
TimeBucketInfo
and theColumnSeries
match, we're golden and ready to write to disk the newRowSeries
we just rendered.Ok so now
Writer.Write()
gets called with theRowSeries
and sends a command to another channel to write the data to disk.So everything should be fine?
Nanoseconds
is written to disk whenisVariableLength
is set but that's because it always is even ifisVariableLength
is false?That seems to fit with the testing comments minus some mysterious precision problem.
But then I found this rewritebuffer.go and started getting worried: https://github.com/alpacahq/marketstore/blob/88008f2a76ec926e9efd2093cb59462d5e8274ba/executor/rewritebuffer.go#L12-L27
Oh man there's more
isVariableLength
stuff :crying_cat_face:It turns out that's used when reading back data for queries...that explains that test that doesn't work.
So as far as I can tell (which is really really questionable) it looks like
Nanoseconds
written by the client are always written bymarketstore
to disk despiteisvariablelength
, (still unclear why that is) and when you read back those same records, the re-write buffer is calculating it's ownNanoseconds
(if it needs to ?), but iffisvariablelength=True
do you always read back aNanoseconds
field despite whether you wrote on in the first place?Summary
isvariablelength
should be documented as an append operation and then maybe even make that a separateClient.append()
method?Nanoseconds
are always written as a field if you useisvariablelength=True
(despite the comments and server code making it super confusing..:cry:)Nanoseconds
values are written by some client tests storing tick dataNanoseconds
for orders trackingPS
Sorry about the long write up but I tend to want to get to know the projects I'm eyeing up seriously for production use :+1: