atreadw1492 / yahoo_fin

Scrape stock price history from new (Spring 2017) Yahoo Finance layout
MIT License
286 stars 125 forks source link

Historical data shifts by a day somewhat erratically #71

Open Zepolak opened 2 years ago

Zepolak commented 2 years ago

Intro

"somewhat erratically" -> no visible logical trigger but consistant in time (I've been running same trials in the last months) It took me several hours to understand what was really going on here (and describe it properly). While the immediate error that one will witness is that there is data lacking for some Fridays (and they'll get an exception if they try to retrieve it), the problem is deeper as it is basically making the retrieval of historical data "random" with a possible shift by one day. One day is not that much in average, but if there's a spike or a crash, it can matter a lot.

Here is code to reproduce the issue, along with my explanations

import yahoo_fin.stock_info
import datetime

def reproduce_issue(ticker, date_iso_format):
    date_iso = datetime.date.fromisoformat(date_iso_format)
    date_exception = date_iso.strftime("%m/%d/%Y")
    date_iso -= datetime.timedelta(days=3)
    date_begin_us_format = date_iso.strftime("%m/%d/%Y")
    date_iso += datetime.timedelta(days=6)
    date_end_us_format = date_iso.strftime("%m/%d/%Y")
    panda_data=yahoo_fin.stock_info.get_data(ticker, start_date= date_begin_us_format, end_date= date_end_us_format)
    print(panda_data)
    print("Trying to get value for " + ticker + " on " + date_exception + " - Exception follows")
    panda_data.at[date_exception, 'open']

# I have several dozen examples (see at the end) but let's focus on one to try to reproduce :
ticker = "USDEUR=X"
date_iso_format = datetime.date.fromisoformat("2021-03-11")
date_begin_us_format = date_iso_format.strftime("%m/%d/%Y")
date_iso_format += datetime.timedelta(days=30)
date_end_us_format = date_iso_format.strftime("%m/%d/%Y")
panda_data=yahoo_fin.stock_info.get_data(ticker, start_date= date_begin_us_format, end_date= date_end_us_format)
print(panda_data)

What are we seeing here ?

Data before 26 March is in line with https://finance.yahoo.com/quote/EUR%3DX/history?p=EUR%3DX Data after that is shifted by one day. And indeed, there is no data on the website for the 28th March, which is a Sunday (rightfully so). But there is data in the panda table for that day, it's the value that's for the following day on the website, and the shift by one day starts from there. It becomes obvious if you try to retrieve the values for the 2nd of April, which is a Friday (so data is there on the website and should be given by the lib)

# Add the following lines after the previous code extract
date_iso_format = datetime.date.fromisoformat("2021-04-02")
exception_date = date_iso_format.strftime("%m/%d/%Y")
print("Trying to get value for 2nd of April - Exception follows")
print("#######################")
panda_data.at[exception_date, 'open'] 

Other examples follow.

All those days are Fridays. I don't know if we can infer anything, but it's likely the shift only goes one way I didn't find any example for a Monday. All my examples are currencies because of what I needed to code for my taxes. But there's no reason to believe other tickers aren't affected. (In particular, EURAUD=X is over-represented due to bias in what I do. Don't deduce anything from it)

Example of other dates failing for ticker EURAUD=X (just uncomment) :

#reproduce_issue("EURAUD=X", "2010-06-04")
#reproduce_issue("EURAUD=X", "2011-07-29")
#reproduce_issue("EURAUD=X", "2011-08-05")
#reproduce_issue("EURAUD=X", "2011-09-02")
#reproduce_issue("EURAUD=X", "2011-10-14")
#reproduce_issue("EURAUD=X", "2011-10-28")
#reproduce_issue("EURAUD=X", "2014-08-01")
#reproduce_issue("EURAUD=X", "2015-07-24")
#reproduce_issue("EURAUD=X", "2017-08-25")
#reproduce_issue("EURAUD=X", "2018-05-18")
#reproduce_issue("EURAUD=X", "2018-06-22")
#reproduce_issue("EURAUD=X", "2018-06-29")
#reproduce_issue("EURAUD=X", "2018-08-24")
#reproduce_issue("EURAUD=X", "2019-08-23")
#reproduce_issue("EURAUD=X", "2021-07-30")
# Other exemple : USDAUD=X
#reproduce_issue("USDAUD=X", "2015-07-17")
#reproduce_issue("USDAUD=X", "2015-08-07")
#reproduce_issue("USDAUD=X", "2015-08-14")
#reproduce_issue("USDAUD=X", "2015-10-16")
#reproduce_issue("USDAUD=X", "2017-04-07")
# Other examples
#reproduce_issue("CADAUD=X", "2015-07-17")
#reproduce_issue("CADAUD=X", "2017-04-07")
#reproduce_issue("CADEUR=X", "2017-04-07")
#reproduce_issue("USDEUR=X", "2017-04-07")
#reproduce_issue("GBPEUR=X", "2018-06-22")
#reproduce_issue("GBPAUD=X", "2018-06-22")

Notice how 2017-04-07 is bad for several tickers. Or 2018-06-22.

Also, I want to take this opportunity to thank atreadw1492 a lot for this lib, as well as all other contributors. This is a very commendable project and I am grateful. Big thumbsup !

dss010101 commented 2 years ago

i think this may be similar to what im seeing for weekly data: https://github.com/atreadw1492/yahoo_fin/issues/86

Zepolak commented 2 years ago

Yes, that may be, although it'd require a bit more digging in to confirm. What I am sure for the bug here is that it's erratic. I haven't found a pattern : not all weeks are shifted.