akapur / pyiqfeed

Python LIbrary for reading DTN's IQFeed
GNU General Public License v2.0
168 stars 108 forks source link

Made all requested changes #8

Closed erickmiller closed 7 years ago

erickmiller commented 7 years ago

PEP compliance (ran static checking and fixing with pylint and autopep) Refactored DerivConn into BarConn Removed DerivativeInfo from library Reverted mkt_tm potentially obsolete bug fix

I think the pep compliance might make this need to be manually merged, don't think I made any changes that should trigger a conflict except for the PEP compliance (maybe the mkt_tm reversion, not sure) but don't expect there to be any actual logical clashes.

erickmiller commented 7 years ago

This pull request is obsolete and is superseded by: https://github.com/akapur/pyiqfeed/pull/10

I will close this request now as you are planning to merge the revisions in the new branch.

akapur commented 7 years ago

Another quick Q? Why is BarConn returning stuff as a dict?

In the rest of the code, including for example HistoryConn which returns historical bar data, data comes back as a numpy structured array. The benefit is that data is converted to a structured array once and then can be used to dump to hdf5 or an TSDB or used in numpy without any further data munging and you probably want to use numpy if you are doing something trading related in python anyway. Look at how QuoteConn is implemented. Much of the code is just dynamically creating the right numpy structured array type based on a request for a different list of update fields. Implementing BarConn should be dramatically simpler, more like HistoryConn, where historical requests for bar data are returned as an array of HistoryConn.bar_type. The only difference is that in HistoryConn since the request is for a specific period in the past, there is no need for a listener, the request function returns the data.

Thanks.

Ashwin

From: Erick [mailto:notifications@github.com] Sent: Wednesday, September 14, 2016 1:21 AM To: akapur/pyiqfeed pyiqfeed@noreply.github.com Subject: Re: [akapur/pyiqfeed] Made all requested changes (#8)

This pull request is obsolete and is superseded by: #10 https://github.com/akapur/pyiqfeed/pull/10

I will close this request now as you are planning to merge the revisions in the new branch.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/akapur/pyiqfeed/pull/8#issuecomment-246909373 , or mute the thread https://github.com/notifications/unsubscribe-auth/AAyQVNpkx4nN_jV5XvNmhrupcnTTKhE1ks5qp4RNgaJpZM4J8Qzn .

erickmiller commented 7 years ago

Yeah I realized this difference in format much later, when I was putting in some of the test example code, and was going to ask about this. In hind sight, it might probably make sense to change the processing function so the data format is consistent...

Basically I implemented it this way before I grokked the numpy named structured array being built in the other functions. To get started I gutted the logic from one request / read function combo of the Lookup class and followed your class inheritance and class template -- but then the way you were building the numpy arrays caused me some errors so I just kept removing code until I had a clean class template that didn't error, and then coded what seemed efficient and made sense from there by looking at the data that was getting passed in. I suppose for consistency it would make sense for this function to return the same data format.

Agreed the numpy arrays are nice for the reasons you mentioned but I'm taking the data and extracting lots of features using various libraries (pandas, scikit-learn and others, etc) as well before sending to database and algo so using the arrays this way directly didn't cross my mind as much.

I kind of ruled out HDF5 but OpenTSDB looks super cool - am currently running some tests with the time series database called Arctic (uses MongoDb under hood) -- I'm currently deciding on DB actually what do you think is better, TSDB or Arctic? Here's a link to the Arctic project on github: https://github.com/manahl/arctic

Also as an aside -- You can still save and load dicts in hdf5 but not sure how processor efficient this method is and it's not something I do often: http://deepdish.io/2014/11/11/python-dictionary-to-hdf5/

erickmiller commented 7 years ago

I'm looking at the code right now and it seems like a relatively isolated quick code change in one function BarConn._process_bars to make the return datatype of BarConn consistent with the np.array datatype the rest of the library is returning in all the other functions, and I think it makes lots of sense to do it -- so I'll make one more update to the code right now to implement this change.

akapur commented 7 years ago

Why have you ruled out HDF5? It’s been used for years for data sets a couple of orders of magnitude larger than anything used in finance. Every single criticism of HDF5 I’ve heard or read over the years, including on some well-known blogs has come from people who are doing things the manual specifically tells you not to do but it’s a long manual and there is a learning curve etc. The other thing about HDF5 is that it’s not forgiving. If you do something wrong, it crashes. This is not a bug, it’s a feature, yes really.

Numpy arrays are what sits underneath pandas, scikit-learn and pretty much every scientific computing library in python.

And, no I’m not going to save dicts in HDF5.

From: Erick [mailto:notifications@github.com] Sent: Wednesday, September 14, 2016 5:54 PM To: akapur/pyiqfeed pyiqfeed@noreply.github.com Cc: Ashwin Kapur ashwin.kapur@gmail.com; Comment comment@noreply.github.com Subject: Re: [akapur/pyiqfeed] Made all requested changes (#8)

Yeah I realized this difference in format much later, when I was putting in some of the test example code, and was going to ask about this. In hind sight, it might probably make sense to change the processing function so the data format is consistent...

Basically I implemented it this way before I grokked the numpy named structured array being built in the other functions. To get started I gutted the logic from one request / read function combo of the Lookup class and followed your class inheritance and class template -- but then the way you were building the numpy arrays caused me some errors so I just kept removing code until I had a clean class template that didn't error, and then coded what seemed efficient and made sense from there by looking at the data that was getting passed in. I suppose for consistency it would make sense for this function to return the same data format.

Agreed the numpy arrays are nice for the reasons you mentioned but I'm taking the data and extracting lots of features using various libraries (pandas, scikit-learn and others, etc) as well before sending to database and algo so using the arrays this way directly didn't cross my mind as much.

I kind of ruled out HDF5 but OpenTSDB looks super cool - am currently running some tests with the time series database called Arctic (uses MongoDb under hood) -- I'm currently deciding on DB actually what do you think is better, TSDB or Arctic? Here's a link to the Arctic project on github: https://github.com/manahl/arctic

Also as an aside -- You can still save and load dicts in hdf5 but not sure how processor efficient this method is and it's not something I do often: http://deepdish.io/2014/11/11/python-dictionary-to-hdf5/

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/akapur/pyiqfeed/pull/8#issuecomment-247167845 , or mute the thread https://github.com/notifications/unsubscribe-auth/AAyQVOFDpLl7i_4bt2dluTNuach1WeUEks5qqG0JgaJpZM4J8Qzn . https://github.com/notifications/beacon/AAyQVPF0-s9F6qakZZmiWZdLEWDU820Hks5qqG0JgaJpZM4J8Qzn.gif

erickmiller commented 7 years ago

Good info thanks.

Why have you ruled out HDF5? It’s been used for years for data sets a couple of orders of magnitude larger than anything used in finance.

Oh, yeah. I should clarify, I've only ruled out HDF5 as a primary data storage format, I've briefly evaluated HDF5 vs pickle and in the (various) cases where I need to locally cache intermediate data (ie big matrix of coefficients, etc) then HDF5 is clearly what I'm using and planning to use more in the future. What I meant is that I've heard some folks using only raw HDF5 for all data storage and retrieval of all data -- rather than do this, my plan is to use a proven time series db that is optimized for this purpose, making it more straightforward to add different forms of structured data such as different frequency fundamental, structured text, economic and econometric data, etc -- currently evaluating arctic: https://github.com/manahl/arctic and OpenTSDB: https://github.com/OpenTSDB/opentsdb
HDF5 is my choice for intermediate data storage (multi-dimensional data that is regenerated on various frequencies but takes too long to regenerate each time) -- the thing about my current use-case of HDF5 is most of the data will be pre-loaded into memory prior to execution time anyway so load-time performance isn't the biggest thing to optimize for me but still HDF5 is still the best format for this purpose. Do you think I'm making the wrong choice to push data from pyiqfeed into a time series data base?

Every single criticism of HDF5 I’ve heard [...] has come from people who are doing things the manual specifically tells you not to do but it’s a long manual and there is a learning curve etc.

Yeah I'm not really a critic but just think it serves a great purpose for static intermediate fixed-format cached data. If your model changes, then it isn't so great and needs to be re-cached as far as I understand. I'm guessing this reference to "the manual" is a figure of speech, but in the random off chance -- is there an actual body of text you're indirectly referring to here? (honest question not trying to be sarcastic lol)

And, no I’m not going to save dicts in HDF5.

Haha I didn't think so :)

The benefit is that data is converted to a structured array once and then can be used to dump to hdf5 or an TSDB or used in numpy without any further data munging and you probably want to use numpy

Yeah this is a cool part of the implemtation for sure! Do you have an opinion on arctic vs. OpenTSDB vs. HDF5? Arctic: https://github.com/manahl/arctic OpenTSDB: https://github.com/OpenTSDB/opentsdb

Thanks!

akapur commented 7 years ago

HDF5 is used as primary data storage for virtually all particle storage experiments. These are experiments that cost millions of dollars to run and which generate sometimes tens of gigs of data a second. So when it comes to something tested and robust, it’s about as good as it gets. Among serious finance users, the library is used as primary data storage behind a number of really serious closed source, internal use only, this thing is a major competitive advantage so I’m not going to release it type databases at large hedge funds and a few banks. It is finicky. If you do something you shouldn’t it will crash and possibly eat your data without warning. It’s not for amateurs.

Relative to Arctic and TSDB, there is really no comparison. Are you seriously asking for an opinion on something that uses MongoDB as a back end when you are planning to store tick data. If it’s just you using it, raw hdf5 using something like h5py (as opposed to a wrapper like pandas hdf5 code) will work well. If you use chunking and compression and the various file drivers intelligently will outperform pretty much anything else, possibly by an order of magnitude. If you are worried about doing something silly, write a simple server that you send commands to which sends you data back in a binary format. And of course don’t do something silly like using compression in hdf5 and then storing the data on a compressed zfs or btrfs volume.

If you want something simpler, read the HDF5 docs carefully until you actually understand them. It may take you a month to really grok it, but you will learn more about handling massive amounts of data, particularly the issues that come up in the trenches, and how to save data to enable easy analysis, than most serious database people and then write something yourself. It’s not that hard.

The only real competition to HDF5 in my opinion is the data storage systems built into INRIA’s ROOT system and kdb. If you can afford kdb and the hardware for it, can stand the K language and a C api which at one point at exactly two functions with the signatures “void k(…);” and “void g(…);” buy it. Remember HDF5 is NOT a database. It’s a library that enables you to write a data store. Do you want a pretty web page or something that actually works?

On Sep 15, 2016, at 6:40 PM, Erick notifications@github.com wrote:

Good info thanks.

Why have you ruled out HDF5? It’s been used for years for data sets a couple of orders of magnitude larger than anything used in finance.

Oh, yeah. I should clarify, I've only ruled out HDF5 as a primary data storage format, I've briefly evaluated HDF5 vs pickle and in the (various) cases where I need to locally cache intermediate data (ie big matrix of coefficients, etc) then this what I'm using and planning to use more in the future. What I meant is that I've heard some folks using only raw HDF5 for all data storage and retrieval of all data -- rather than do this, my plan is to use a proven time series db that is optimized for this purpose, making it more straightforward to add different forms of structured data such as different frequency fundamental, structured text, economic and econometric data, etc -- currently evaluating arctic:https://github.com/manahl/arctic https://github.com/manahl/arctic and OpenTSDB: https://github.com/OpenTSDB/opentsdb https://github.com/OpenTSDB/opentsdb I may end up using both and almost certainly more than one DB and HDF5 is my choice for intermediate data storage (data that is regenerated on various frequencies but but takes too long to regenerate each time) -- the thing about HDF5 is that it doesn't really matter (it does but not that much) as most of the data will be pre-loaded into memory prior to execution time but anyway it's still the best format for this purpose. Do you think I'm making the wrong choice to go with a time series data base, I think that's not what you're saying?

Every single criticism of HDF5 I’ve heard [...] has come from people who are doing things the manual specifically tells you not to do but it’s a long manual and there is a learning curve etc.

Yeah I'm not really a critic but just think it serves a great purpose for static intermediate cached data. I'm guessing this reference to "the manual" is a figure of speech, but in the off chance, is there an actual body of text you're indirectly referring to here? :)

And, no I’m not going to save dicts in HDF5.

Haha I didn't think so :)

The benefit is that data is converted to a structured array once and then can be used to dump to hdf5 or an TSDB or used in numpy without any further data munging and you probably want to use numpy

Do you have an opinion on arctic vs. OpenTSDB vs. HDF5? Arctic: https://github.com/manahl/arctic https://github.com/manahl/arctic OpenTSDB: https://github.com/OpenTSDB/opentsdb https://github.com/OpenTSDB/opentsdb Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/akapur/pyiqfeed/pull/8#issuecomment-247475142, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyQVGQb6yzb9sWlGtSXR0GZb_te7xDdks5qqcl5gaJpZM4J8Qzn.

erickmiller commented 7 years ago

Thanks for the candor. Your advice is great and definitely appreciate the ideas, facts, opinions, etc.

HDF5 is used as primary data storage for virtually all particle storage experiments.

Cool yeah as stated I'm 100% on board with using HDF5 for big n-dimensional matrices. That's what I'm using them for too.

Are you seriously asking for an opinion on something that uses MongoDB as a back end when you are planning to store tick data.

Yeah, well, as I understand, it's in use by several live firms -- I haven't run metrics yet but according to the arctic project "Arctic can query millions of rows per second per client" which seems optimized appropriately for tick data and sounds pretty promising to me... maybe I'm totally wrong...?

The HBase sytem in OpenTSDB reports "OpenTSDB allows you to collect thousands of metrics from tens of thousands of hosts and applications, at a high rate (every few seconds)." which also sounds great and nicely scalable and everyone knows Hadoop is legit but based on this statement vs arctic it doesn't seem massively compelling -- I am going to evaluate it though so at some point I'll report back metrics if you're interested.

Plus... arctic was at PyData 2014 and is on youtube so it must be legit (joking) but this is an interesting video and they show how arctic outperformed their legacy HDF5 system: "...a market data system that stores a variety of Timeseries-based financial data for research and live trading at a large systematic hedge fund"

Full Video Link with comparison metrics presented: https://www.youtube.com/watch?v=FVyIxdxsyok

It’s not for amateurs.

Ok :) Thanks. Anyway, maybe mongodb is amateur (maybe) but it's undeniably in use by many time sensitive critical enterprise applications and my unproven assumption based on the research I've done so far is that with local file system and local network optimization it looks at least like arctic is a feasible potential option to evaluate - and also when estimating implementation time, technical debt, code compatibility across the system, etc -- it looks like this option could dramatically simplify implementation while increasing rate of rapid prototyping which is a positive for creating new strategies that I'm factoring in -- so long as runtime performance is real-time optimal. I'm not 100% sure on my total latency thresh-hold right now but "millions of rows per second per client" sounds like something that is serious and not amateur. Am I totally off base here?

If it’s just you using it, raw hdf5 using something like h5py (as opposed to a wrapper like pandas hdf5 code) will work well. If you use chunking and compression and the various file drivers intelligently will outperform pretty much anything else, possibly by an order of magnitude

Cool. Good to know I will look closer at this option although despite it not being "hard" was hoping to not roll my own home-brewed nosql-like data base for obvious reasons (time, bugs, reinventing wheel, etc)

And of course don’t do something silly like using compression in hdf5 and then storing the data on a compressed zfs or btrfs volume.

For sure. Thanks for the pointer.

kdb and the hardware for it, can stand the K language and a C api

Cool. Yeah C api is totally cool but the rest, maybe not as much. I looked at this for some time and after factoring in for the technical debt, implementation time, ramp up time, and seemingly asymmetrical cost -- also hearing a few people say it used to be best-of-class only-game-in-town but now there are several equivalents that are more pythonic, decided to put kdb on the backburner list of options.

Remember HDF5 is NOT a database

Yeah, of course.

Do you want a pretty web page or something that actually works?

Haha, ok -- well, whether that is rhetorical or not: I need something that works optimally first, and data visualization is a second priority - viz will be a component albeit is a second or third priority.

Thanks again for the tips / advice, etc -- good stuff. This is a new system I'm building so cool to hear your opinions and wisdom on this stuff.