akapur / pyiqfeed

Python LIbrary for reading DTN's IQFeed
GNU General Public License v2.0
172 stars 109 forks source link

read_market_open in conn.py #9

Closed erickmiller closed 8 years ago

erickmiller commented 8 years ago

Is this function supposed to have a return statement in it? I noticed a warning in pylint that the expression executed isn't assigned to anything, I think you're passing this functor around in QuoteConn class but wasn't sure if there is some special sauce going on here that wasn't obvious to me. I was going to test it once the market is open but figured I'd log this now since I'm just back from a workout and am reviewing more carefully the non-error pylint messages

akapur commented 8 years ago

Let me check. Few things went weird when I reverted. Not been following good git discipline with branches etc since it used to be just me adding stuff to the project. I’ll merge your pull into a branch called erickmiller-changes, add some changes I’ve been working on and then merge to master in a few days once I’m happy with the result.

On Sep 13, 2016, at 11:00 PM, Erick notifications@github.com wrote:

Is this function supposed to have a return statement in it? I noticed a warning in pylint that the expression executed isn't assigned to anything, I think you're passing this functor around in QuoteConn class but wasn't sure if there is some special sauce going on here that wasn't obvious to me. I was going to test it once the market is open but figured I'd log this now since I'm just back from a workout and am reviewing more carefully the non-error pylint messages

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/akapur/pyiqfeed/issues/9, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyQVHo-hXbO3LR3BW2kCP8-RxjkN0waks5qp2NMgaJpZM4J8WlL.

erickmiller commented 8 years ago

Cool sounds like a plan, I'm also cleaning up some more of the pylint warnings (nothing serious but just little things it's complaining about) pylint spits out way too many messages on almost all code so am trying to be selective to fix the things that are most obvious -- earlier, when running pylint conn.py with no args, the code was 6.91/10 and with a few updates the code has now moved to an improved to 7.29/10 maybe I should wait to do any more of these PEP8 related fixes though until you've merged so there isn't another merge conflict?

And if you run the current code with: pylint --confidence=HIGH -E -s y conn.py

Then it results in:

Your code has been rated at 10.00/10

Yeah, everything seems to be working well for now so I think I'll wait on any more of these changes until you're done with the merge.

erickmiller commented 8 years ago

Cool I just sent you a fresh pull request with the few additional PEP related updates that were made this evening, I sent the pull request to your new branch so this should make it easier for you to do the merge there, here is a link to the new pull request: https://github.com/akapur/pyiqfeed/pull/10 I also closed the old pull request to master since this new one is more recent and sent to the preferred branch. I'll await you completing the work in branch and then the merge to master before any additional code modifications on my side to keep things simple.

akapur commented 8 years ago

Quick Q. Is there a reason you implemented the NewsConn class by requesting the news back as XML instead of requesting it back as comma delimited text?

On Sep 14, 2016, at 1:25 AM, Erick notifications@github.com wrote:

Cool I just sent you a fresh pull request with the few additional PEP related updates that were made this evening, I sent the pull request to your new branch so this should make it easier for you to do the merge there, here is a link to the new pull request: #10 https://github.com/akapur/pyiqfeed/pull/10 I also closed the old pull request to master since this new one is more recent and sent to the preferred branch. I'll await you completing the work in branch and then the merge to master before any additional code modifications on my side to keep things simple.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/akapur/pyiqfeed/issues/9#issuecomment-246909820, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyQVNW0XY-hffDkl9Nl8VyN9Qq4PNb6ks5qp4UugaJpZM4J8WlL.

erickmiller commented 8 years ago

Mostly because it just seemed like the most clean way to code it, since iqfeed supplies the xml for news (and doesn't for other data types as I discovered later) and the xml actually contains keys for the associative arrays (dicts) rather than an order based parse which isn't as clear. Not sure it makes a huge difference - the xml lib is part of the standard library and since Python 3.3 uses a compiled c version internally so execution speed shouldn't be too different - why do you ask? Are you concerned with added processing latency?

erickmiller commented 8 years ago

I updated BarConn based on your request: https://github.com/akapur/pyiqfeed/pull/8

Regarding the csv vs. xml question -- I'm most likely of the opinion that NewsConn is probably good as-is for me at least for now since refactoring out the xml library is a bit more -- unless you have clear metrics or reasons or a material benefit from parsing this particular data as a comma delimited string (csv) instead of using xml.etree.ElementTree which uses underlying c functions to parse these small snippets of structured xml?

If there is a real clear benefit to csv than I'm open to it but metrics would be cool to know on this one. I ran some tests and there is a small execution speed difference in a very simple string example so this is probably only somewhat partially representative of real-world. Regardless, here are some execution speed differences:

Library took this long to import (only happens once):

sttm=time.time();import xml.etree.ElementTree; print(time.time()-sttm);

4.05311584473e-06

Then parsing (into an iterable object) a small xml document vs a csv the timing differences were:

XML:

xml_text="<xml><tag>foo</tag><id>bar</id></xml>"
sttm=time.time();xml.etree.ElementTree.fromstring(xml_text); print(time.time()-sttm);

9.20295715332e-05

CSV:

csv_text="csv,foo,bar,test"
sttm=time.time();csv_text.split(",");print(time.time()-sttm);

1.00135803223e-05

This isn't an exact apples to apples test and granted, the split() is 9x faster but the execution speed difference at this level of temporal precision (less than one tenth of one millisecond) would only make a difference if you are using news for high frequency trading (not sure how that would work) or if NewsConn was locking the process from getting real time streaming quotes... which by design it doesn't have to -- ok, gotta run to a meeting let me know your opinion or thoughts. Maybe I'm wrong and csv vs xml could make a bigger difference in real production scenario also taking bigger packets and network latency into consideration - but I'd still be surprised if there is greater than a few milliseconds difference to query all news for all tickers - and since news timestamps are only measured at most on the second frequency this difference seems negligable in this case; but I could be wrong I haven't tested it that far.

akapur commented 8 years ago

I don’t have news classes in the library but I do use some news stuff. DTN’s implementation of XML at least in the past has been flaky/buggy and an afterthought more than a properly engineered feature. It was put in place for “programmers” who basically wanted to pass XML to a control and end up with a news display app. In the one area where XML would actually have made sense, their dynamic fieldsets for quote data, they don’t use it which should tell you something. It’s not just the penalty in parsing which for a single item may seem tiny in magnitude but really over an order of magnitude difference so if you want to do something real with news items like say downloading a large corpus of text and using a large number of news items and some keywords to figure out if a news driven volatility jump is in progress or some such, it’s going to matter. It’s not just the cpu cost of your code, it’s also that IQConnect.exe starts using a lot more cpu when you ask it for a lot of xml formatted text.

In general using something like XML in auto-trading code is something someone who’s been around the block in the space a few times wouldn’t normally do. What you want for auto-trading is something a) simple and obvious (if your code needs a lot of comments it’s not either) b) easy to use and c) brittle. If the format of something changes you want to crash immediately with an error message that tells you want happened and why, NOT somehow adapt to the new format even if 99.9% of the time doing so leads to your code doing the right thing. That one exception where you have a subtle problem can bankrupt you in minutes and there is no do-over. Ceremony and tests and procedures don’t help you with this. You have to remain paranoid. Even when data actually comes across in a similar dynamic format (like XML), most people who’ve done this for a while will write code that is deliberately brittle so the program crashes and loudly so someone can check what happened.

From: Erick [mailto:notifications@github.com] Sent: Wednesday, September 14, 2016 5:20 PM To: akapur/pyiqfeed pyiqfeed@noreply.github.com Cc: Ashwin Kapur ashwin.kapur@gmail.com; Comment comment@noreply.github.com Subject: Re: [akapur/pyiqfeed] read_market_open in conn.py (#9)

Mostly because it just seemed like the most clean way to code it, since iqfeed supplies the xml for news (and doesn't for other data types as I discovered later) and the xml actually contains keys for the associative arrays (dicts) rather than an order based parse which isn't as clear. Not sure it makes a huge difference - the xml lib is part of the standard library and since Python 3.3 uses a compiled c version internally so execution speed shouldn't be too different - why do you ask? Are you concerned with added processing latency?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/akapur/pyiqfeed/issues/9#issuecomment-247157804 , or mute the thread https://github.com/notifications/unsubscribe-auth/AAyQVIi1OK3nSinbHlpMob5w6_tLTdnCks5qqGTwgaJpZM4J8WlL . https://github.com/notifications/beacon/AAyQVJXiJtWgGYR2lCNC-uUOQrcIVMZCks5qqGTwgaJpZM4J8WlL.gif

erickmiller commented 8 years ago

DTN’s implementation of XML at least in the past has been flaky/buggy and an afterthought more than a properly engineered feature. [...] It’s not just the cpu cost of your code, it’s also that IQConnect.exe starts using a lot more cpu when you ask it for a lot of xml formatted text.

Hmm. Ok, points taken and makes sense. That's unfortunate. Another thing to note is that DTN has XML as the default return type for the news API call also, fyi. So I started getting XML from the call during testing before I even began passing in all the optional arguments. Being new to DTN, this would have been great to know before I coded all those functions using xml - since it was the default, and from a data structure perspective (whether as an engineer, developer, programmer, quant or a hacker or whatever) having access to structured data should always lead to more stable and readable code.

It’s not just the penalty in parsing which for a single item may seem tiny in magnitude but really over an order of magnitude difference so if you want to do something real with news items like say downloading a large corpus of text and using a large number of news items and some keywords to figure out if a news driven volatility jump is in progress or some such, it’s going to matter.

Again, am on the same page here but important to note that in your example above, I don't think it's totally real-world because large corpuses aren't really available and DTN separates the headline query from the story body query so you never make a single query that returns really large blocks of xml, just fyi. But again... points taken and all stated makes perfect sense although I'm still speculating we're talking about a difference of milliseconds here on timestamps that are measured in seconds but I digress. I'll take a look at this further -- I mean why use something that is theoretically 9x slower right? Even if it is at the milli/micro second scale...

What you want for auto-trading is something a) simple and obvious (if your code needs a lot of comments it’s not either)

I would generally agree except sometimes there are small exceptions that are far better for multi-developer maintenance. In the case of doc strings for public API functions though, comments serve a very useful purpose and it is a minor violation of PEP standards to leave them out, and it makes learning a new API, regardless of how clear the code seems to the author, time-consuming at best, and/or leads to error-prone implementations at worst. Like a comment about not using xml haha. Or the fact that iqfeed does not reliably return news based on the date -- for example -- is good info for people to know. Doc strings on public api functions, IMHO should be in all exposed api classes - the Python standard library is a good reference point for this and virtually all auto-doc tools (sphinx pydoc and others) use the doc string to very seamlessly generate API docs.

If the format of something changes you want to crash immediately with an error message that tells you want happened and why, NOT somehow adapt to the new format even if 99.9% of the time doing so leads to your code doing the right thing. That one exception where you have a subtle problem can bankrupt you in minutes and there is no do-over.

Yeah, totally agreed on this. But that's probably more valid for entering new positions or closing losing positions, right? I mean hopefully that 00.1% crash doesn't happen when you need to submit a sell order on a winning long position or close out a winning short position, that would be my only concern with crashing the entire algo always when there is a data discrepancy or the data provider has some small hiccup (but that may not be what you mean). Do you have any suggestions for this case?

Anyway, I'll take a closer look at the news xml code today and figure out how much time it will take to move it to csv but honestly not sure how much time I will have to do this, getting this merge pushed is taking more effort than I expected.

Also, what are your thoughts on Python 2.7 compatibility -- unfortunately I have a handful of small but critical library dependencies that are still not yet 3.5 compatible and may need this but am trying to work around it currently. What do you think would be the best way to address this or thoughts on this?

erickmiller commented 8 years ago

Based on your comments, tonight I implemented a csv version of all of the news processing functions and added an argument to all of the API calls to send in the request type (req_type) as 't' which returns data in csv.

It was pretty easy code to write and csv is nicely pre-processed from the library so it actually required less logic than the code that used the xml libraries which was cool and encouraging that the csv version should be faster. It still took a couple hours but whatever. It's done now. Yay :+1:

I then left the xml versions of the processing functions in the library so that a request type (req_type) could be sent in to all the public news api calls as either 'x' for xml or 't' for csv, and then tested the overall execution speed of the exact same code 4 times, on the two different request types, xml and csv -- on 100 S&P 500 tickers requesting all (which is lots) available news headline items (I think DTN caps the returned headlines to 1000 total count per headlines request), then, also requested the full stories for these news items and story counts.

Exact same code. Large corpus of text (kind of). I was actually surprised by the outcome. It turned out that the xml version was faster in 3 out of 4 tests and on average over 4 seconds faster. Maybe that's why DTN made XML the default option for the news API? Here are the results:

CSV_TIME_1:  216.09632802009583
XML_TIME_1:  201.02698922157288 (winner)

CSV_TIME_2:  185.613765001297
XML_TIME_2:  185.32312273979187 (winner)

CSV_TIME_3:  186.31582188606262 (winner)
XML_TIME_3:  187.7472047805786

CSV_TIME_4:  186.87847185134888
XML_TIME_4:  184.62888646125793 (winner)

AVERAGE_CSV:  193.7260966897010825
AVERAGE_XML:  189.68155080080032 (overall winner)

On average, due to the very first execution being a bit of an outlier, the XML version of the news api was about 4.0445 seconds faster for a big news request with lots of data. If you factor that first test out though, the results are very similar, but still the xml version appears to be slightly faster for some reason.

I have no idea why this is the case, but those are the numbers. There are all kinds of factors that could be in play here most likely network latency variability or data-type dependent delivery speed variability from DTN's side -- but regardless -- at this point I think that, in the least, it's inconclusive to assume that the XML version of the news requests api is slower than the csv version -- and at most, it could be concluded that the xml version of the news api is somehow faster.

But maybe there is something else going on -- it was the exact same code running on the exact same symbols though (with no other local processor load an no other local network load) and the csv parsing required very little logic (mostly just index ordered lists)... so I really thought csv would be faster -- not sure what else it could be other than xml just being delivered faster from DTN because as we already know it is slower to parse XML by some number of milliseconds.

I will leave both versions in the API, leave the req_type='x' as the default since iqfeed also has 'x' as the default for their tcp/ip news api interface -- and leave the request type as an argument to each api function so the user of the API can decide.

erickmiller commented 8 years ago

Just sent an updated pull request with the new refactored NewsConn supporting comma delimited text: https://github.com/akapur/pyiqfeed/pull/13

Am hopeful that this version will get merged to master soon. :+1:

I'm closing this issue since I fixed the missing return statement in that expression and also refactored NewsConn which were the two major things we discussed in this issue thread.

Thanks Erick