atreadw1492 / yahoo_fin

Scrape stock price history from new (Spring 2017) Yahoo Finance layout
MIT License
292 stars 125 forks source link

Can't Async/Await yahoo_fin api calls #83

Open mpaccione opened 2 years ago

mpaccione commented 2 years ago

I am using FastAPI and it would seem I can't directly await the API Calls because I receive this error:

object numpy.float64 can't be used in 'await' expression

It would be very helpful to be able to await these:

            price = await si.get_live_price(symbol)
            chain = await options.get_options_chain(symbol, expiration)[optionType]
mpaccione commented 2 years ago

Okay this wasn't too difficult...

I changed request to aiohttp calls. If you want I may be able to make some time to make a PR and add in the calls so that people can import the standard or FastAPI async versions.

Let me know @atreadw1492, thanks!

dss010101 commented 2 years ago

how much faster is it? was thinking of also making some modifications to use modin and ray to see if can make things faster.

mpaccione commented 2 years ago

Hey, it was incredibly fast. In fact too fast lol. I remember I was using asyncio and making these calls but I believe I was hitting some rate limiting on yahoo. Basically the pages wouldn't all finish loading.

I decided to keep it how it was sync but then use rabbitmq and batch the requests out to the queue. I this way I was able to get better performance while still having the request content complete to be able to be parsed.

On Sun, Sep 18, 2022, 9:41 AM msingh00 @.***> wrote:

how much faster is it? was thinking of also making some modifications to use modin and ray to see if can make things faster.

— Reply to this email directly, view it on GitHub https://github.com/atreadw1492/yahoo_fin/issues/83#issuecomment-1250333840, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBIBQQUNOWC7766DKGT6XDV64Z2NANCNFSM5TJBSLQQ . You are receiving this because you authored the thread.Message ID: @.***>

dss010101 commented 2 years ago

nice..so i guess you are fetching multiple tickers at once. how many tickers at once? curious as to your implementation..which sounds like a producer/consumer pattern. I was able to achieve just over 7 minutes to pull historical data for all nasdaq listed company(~6000+) using modin[ray]. i wonder what it would be like if yahoo_fin also had some async implementation. didnt run-in to the yahoo rate limit i think (or maybe i was throttled and dont know it).

mpaccione commented 2 years ago

The amount of concurrent connections is just too limited in what I had tried.

First I said okay let me work from 150 x 2 (The two calls) and work backwards. I couldn't get a consistent return response with even less than 10. In switching to asyncio it seems that the datascraping part would execute before the page finished loading. Fiddled around with other scraper but it seemed to much work currently. There were also rate limiting issues.

Currently I am building and MVP so I just went with a queue infrastructure and am relying heavily on redis caching and database for short and long term data. However, I do have computations that can only have 15 minutes max old data so I am a bit hamstrung onto this queue for now. I suppose I could use serverless functions in the future for those but queue is cheap for ideation and I don't have to fuss around with AWS or DO.

Originally am a JS dev so am curious more about your 6000+ pull experience? It seems very performant -- granted I don't know how long the historical period was...

dss010101 commented 2 years ago

I am building and MVP so I just went with a queue infrastructure and am relying heavily on redis caching and database for short and long term data. However, I do have computations that can only have 15 minutes max old data so I am a bit hamstrung onto this queue for now. I suppose I could use serverless functions in the future for those but queue is cheap for ideation and I don't have to fuss around with AWS or DO.

The data is for about 20 years of daily data usually.

I compared performance of:

  1. Simple Python + pandas + parallel processing (concurrent.futures)
  2. Modin (with ray built-in)
  3. Ray + Python

I have settled on the last as its more stable, but yields better performance than plain python multi-processing. I do development on windows (though my servers are hosted linux)...Though it's great for dealing with large data sets, modin has some stability issues on windows right now - so doing development on windows became a deal breaker. Ray allowed me to get better scaling and performance than standard python + multi-processing. It achieves this i believe by doing something you seem to be doing..using a centralized cache to deal with performance issues related to cross process serialization. At one point they were using redis, but right now i believe they have gotten rid of redis and implemented a custom GCS with optional backing storage (which could still use redis if you like).

Its a very interesting project that i hope gains traction and support.