alpacahq / marketstore

DataFrame Server for Financial Timeseries Data
Apache License 2.0
1.88k stars 231 forks source link

What the heck is "isvariablelength" for? Nanoseconds wat? #325

Open goodboy opened 4 years ago

goodboy commented 4 years ago

I'm looking at the relevant server code sections:

You know what's handy, putting the name Append somewhere in the func name :wink:

Secondly,

Also it's probably worth mentioning that there's some kind of relationship with the Nanoseconds column. I got real confused when using the client to do stuff and got weird different numbers back depending on whether I wrote the Nanoseconds field and used isvariablelength (look, the unit tests and me are the same :smiling_face_with_three_hearts:).

That is, if isvariablelength is set:

Ok so let's stop and think here.

We're removing Nanoseconds because, before writing to disk, we convert ColumnSeries -> RowSeries without passing through the rowType flag, which would make NewRowSeries add the 'Nanoseconds' DataShape which we apparently need:

because the read() function for variable types inserts a 32-bit nanoseconds column.

But really it's because we already got the Nanoseconds out and are passing it as a []time.Time to Writer.WriteRecords()?

Uh, ok so I guess because the read() means when GetTime() get's called, or?

Again, comment says we need this Nanoseconds field for "reading" and GetTime() seems to need it for generating a []time.Time output, if there is a Nanoseconds column. Well that's good because (as mentioned in last bullet ^) we are calling it then handing it to Writer.WriteRecords().

Let's note that ColumnSeriesMap.FilterColumns() method requires Nanoseconds as part of the index.

Ok where are we again?

:exploding_head:. So this all seems pretty circular.

Alright let's go back to what we were doing. Right, WriteCSM(), we're writing our ColumnSeriesMap to disk! https://github.com/alpacahq/marketstore/blob/88008f2a76ec926e9efd2093cb59462d5e8274ba/executor/writer.go#L282

Ok so if a tbi don't exist and isVariableType is set, we're gonna pass recordType=io.VARIABLE to io.NewTimeBucketInfo().

So we have a ColumnSeries with no Nanoseconds DataShape (if isVariableLength is set) and we're making a new TimeBucketInfo with a "variable length" meaning this stuff gets set: https://github.com/alpacahq/marketstore/blob/4811cc6a14a917e97261ce19b223d9b15325037e/utils/io/metadata.go#L85-L87

Cool, let's go back to WriteCSM()...

https://github.com/alpacahq/marketstore/blob/88008f2a76ec926e9efd2093cb59462d5e8274ba/executor/writer.go#L314-L335

So if the columns in the TimeBucketInfo and the ColumnSeries match, we're golden and ready to write to disk the new RowSeries we just rendered.

Ok so now Writer.Write() gets called with the RowSeries and sends a command to another channel to write the data to disk.

So everything should be fine? Nanoseconds is written to disk when isVariableLength is set but that's because it always is even if isVariableLength is false?

That seems to fit with the testing comments minus some mysterious precision problem.

But then I found this rewritebuffer.go and started getting worried: https://github.com/alpacahq/marketstore/blob/88008f2a76ec926e9efd2093cb59462d5e8274ba/executor/rewritebuffer.go#L12-L27

Oh man there's more isVariableLength stuff :crying_cat_face:

It turns out that's used when reading back data for queries...that explains that test that doesn't work.

So as far as I can tell (which is really really questionable) it looks like Nanoseconds written by the client are always written by marketstore to disk despite isvariablelength, (still unclear why that is) and when you read back those same records, the re-write buffer is calculating it's own Nanoseconds (if it needs to ?), but iff isvariablelength=True do you always read back a Nanoseconds field despite whether you wrote on in the first place?

Summary

PS

Sorry about the long write up but I tend to want to get to know the projects I'm eyeing up seriously for production use :+1:

goodboy commented 4 years ago

Maybe also a small demonstration of the client behaviors:

nav] In [32]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140000)], dtype=[('Epoch', 'i8'), ('
          ...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'Monkey/1Sec/TICK', isvariablelength=False),                               
Out[32]: ({'responses': None},)

[nav] In [33]: client.query(pymarketstore.Params('Monkey', '1Sec', 'TICK')).first().df()                                       
Out[33]: 
                           Bid  Nanoseconds
Epoch                                      
2016-01-01 10:00:00+00:00  3.0       140000

[nav] In [34]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140001)], dtype=[('Epoch', 'i8'), ('
          ...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'Monkey/1Sec/TICK', isvariablelength=False),                               
Out[34]: ({'responses': None},)

[nav] In [35]: client.query(pymarketstore.Params('Monkey', '1Sec', 'TICK')).first().df()                                       
Out[35]: 
                           Bid  Nanoseconds
Epoch                                      
2016-01-01 10:00:00+00:00  3.0       140001

So if you never wrote Nanoseconds and isvariablelength=False then you don't get it magically created:

[ins] In [45]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3)], dtype=[('Epoch', 'i8'), ('Bid', 'f
          ...: 4'),]),'Monkey_NO_NANO/1Sec/TICK', isvariablelength=False),                                                     
Out[45]: ({'responses': None},)

[ins] In [46]: client.query(pymarketstore.Params('Monkey_NO_NANO', '1Sec', 'TICK')).first().df()                               
Out[46]: 
                           Bid
Epoch                         
2016-01-01 10:00:00+00:00  3.0
[nav] In [37]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140000)], dtype=[('Epoch', 'i8'), ('
          ...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'APPEND/1Sec/TICK', isvariablelength=True)                                 
Out[37]: {'responses': None}
[ins] In [39]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3, 140001)], dtype=[('Epoch', 'i8'), ('
          ...: Bid', 'f4'), ('Nanoseconds', 'i4')]),'APPEND/1Sec/TICK', isvariablelength=True),                                
Out[39]: ({'responses': None},)
[ins] In [40]: client.query(pymarketstore.Params('APPEND', '1Sec', 'TICK')).first().df()                                       
Out[40]: 
                           Bid  Nanoseconds
Epoch                                      
2016-01-01 10:00:00+00:00  3.0       140000
2016-01-01 10:00:00+00:00  3.0       140001

But wait let's continue with that and find our magic Nanoseconds created for us always:

[nav] In [41]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3)], dtype=[('Epoch', 'i8'), ('Bid', 'f
          ...: 4'),]),'APPEND/1Sec/TICK', isvariablelength=True),                                                              
Out[41]: ({'responses': None},)

[nav] In [42]: client.query(pymarketstore.Params('APPEND', '1Sec', 'TICK')).first().df()                                       
Out[42]: 
                           Bid  Nanoseconds
Epoch                                      
2016-01-01 10:00:00+00:00  3.0            0
2016-01-01 10:00:00+00:00  3.0       140000
2016-01-01 10:00:00+00:00  3.0       140001

^ That doesn't happen if you use isvariablelength=False:

[ins] In [44]: client.write(np.array([(pd.Timestamp("2016-01-01 10:00:00").value/10**9, 3)], dtype=[('Epoch', 'i8'), ('Bid', 'f
          ...: 4'),]),'Monkey/1Sec/TICK', isvariablelength=True),                                                              
Out[44]: 
({'responses': [{'error': 'unable to match data columns ([{Epoch INT64} {Bid FLOAT32}]) to bucket columns ([{Epoch INT64} {Bid FLOAT32} {Nanoseconds INT32}])',
    'version': '34352c9738c9164d7c65264a532d99341c57fae2'}]},)