Closed mickdewald closed 1 year ago
Hi @micktg, when did you first notice this?
Are the close prices similarly misaligned or is it just the split data?
I first noticed this not so long ago. Maybe a few weeks tops. I read in the changelogs for v2.3.0, that the history method has been refactored, so my guess is, that this might have something to do with that.
The weird thing is, that I could never provoke that behavior (wrong history data respectively wrong split dates) if test it and my code seems to collect the correct data for various other stock symbols as well.
Could certainly be related to v2.3. I've had a look through the changes although nothing's immediately jumping out at me.
Few more questions:
The ideal would be if you could offer a concrete minimal example of how price requests for the same symbol at different times has returned differently indexed data.
Thanks for raising this.
If I run the code (see above) right now, I get the following result
"2022-09-23": 67.95999908447266,
"2022-09-26": 66.30000305175781, <-- Monday (Sunday missing)
"2022-09-27": 67.16999816894531,
"2022-09-28": 68.36000061035156,
"2022-09-29": 64.13999938964844,
"2022-09-30": 63.36000061035156,
"2022-10-03": 66.11000061035156,
This is my databse entry from prd Example AMD in prd
"2022-09-23": 67.95999908447266,
"2022-09-25": 66.30000305175781, <-- Sunday
"2022-09-26": 66.30000305175781, <-- Monday
"2022-09-27": 67.16999816894531,
"2022-09-28": 68.36000061035156,
"2022-09-29": 64.13999938964844,
"2022-09-30": 63.36000061035156,
"2022-10-02": 66.11000061035156,
"2022-10-03": 66.11000061035156,
Somehow in prd there are entries for the weekend. This is in and of itself nothing bad, but this indicated to me, that there might be so different behavior either in the yahoo history response or maybe in your history function than it was before.
So, its NASDAQ, NYSE, and even a German stock exchange. So, this does not seem to be a behavior specific to an exchange.
Here is an excerpt
if (not model.delisted and (not last_update or is_new_day(last_update))):
close_history, splits_temp = get_price_close_history(ticker, model.symbol)
I think I've worked out what's happening.
The data being returned by Ticker.history
is correct and is being indexed as intended.
How daily data is indexed was changed in v2.3 to fix some bugs and provide for more consistent and meaningful indexing. The 'date' level of the index always has dtype 'object'. Closed sessions are represented by a datatime.date
object. If the current session is open, or hasn't long closed, then Yahoo timestamps this with the time of the last trade. I refer to this as a 'live indice'. It's important to be able to distinguish a live indice from a closed session in order to not unwittingly assume that the 'close' is the session close rather than the live price that it actually represents. For this reason a live indice is represented with a datetime.datetime
object that gives the time of the last trade and makes clear the timezone of that time. (For more info on the live indices see this comment on the issue fixed in v2.3).
Why am I saying all this? Your code is using DataFrame.index_reset
. If there's no live indice then it does the following (which is how it will always have worked prior to v2.3):
import yahooquery as yq
amd = yq.Ticker("AMD")
df = amd.history(interval="1d", start="2023-02-01", end="2023-02-04")
df
open | high | low | close | volume | adjclose | ||
---|---|---|---|---|---|---|---|
symbol | date | ||||||
AMD | 2023-02-01 | 78.470001 | 85.480003 | 77.879997 | 84.639999 | 152548200 | 84.639999 |
2023-02-02 | 84.290001 | 88.940002 | 83.199997 | 88.309998 | 97762100 | 88.309998 | |
2023-02-03 | 86.660004 | 88.800003 | 85.830002 | 86.089996 | 66525400 | 86.089996 |
df.reset_index()
symbol | date | open | high | low | close | volume | adjclose | |
---|---|---|---|---|---|---|---|---|
0 | AMD | 2023-02-01 | 78.470001 | 85.480003 | 77.879997 | 84.639999 | 152548200 | 84.639999 |
1 | AMD | 2023-02-02 | 84.290001 | 88.940002 | 83.199997 | 88.309998 | 97762100 | 88.309998 |
2 | AMD | 2023-02-03 | 86.660004 | 88.800003 | 85.830002 | 86.089996 | 66525400 | 86.089996 |
However, if there is a live indice then this is happening...
df = amd.history(interval="1d", start="2023-02-01", end="2023-02-07")
df
open | high | low | close | volume | adjclose | ||
---|---|---|---|---|---|---|---|
symbol | date | ||||||
AMD | 2023-02-01 | 78.470001 | 85.480003 | 77.879997 | 84.639999 | 152548200 | 84.639999 |
2023-02-02 | 84.290001 | 88.940002 | 83.199997 | 88.309998 | 97762100 | 88.309998 | |
2023-02-03 | 86.660004 | 88.800003 | 85.830002 | 86.089996 | 66525400 | 86.089996 | |
2023-02-06 12:15:01-05:00 | 84.629997 | 86.278999 | 83.820000 | 84.174004 | 29580744 | 84.174004 |
df.reset_index()
symbol | date | open | high | low | close | volume | adjclose | |
---|---|---|---|---|---|---|---|---|
0 | AMD | 2023-01-31 19:00:00-05:00 | 78.470001 | 85.480003 | 77.879997 | 84.639999 | 152548200 | 84.639999 |
1 | AMD | 2023-02-01 19:00:00-05:00 | 84.290001 | 88.940002 | 83.199997 | 88.309998 | 97762100 | 88.309998 |
2 | AMD | 2023-02-02 19:00:00-05:00 | 86.660004 | 88.800003 | 85.830002 | 86.089996 | 66525400 | 86.089996 |
3 | AMD | 2023-02-06 12:15:01-05:00 | 84.629997 | 86.278999 | 83.820000 | 84.174004 | 29580744 | 84.174004 |
It appears that on creating a column from the 'date' level pandas is coercing the dtype from 'object' to a tz-aware dtype based on the live indice, in this case datetime64[ns, America/New_York]
. The consequence is that all the date objects are converted to tz-aware values. Your key.strftime('%Y-%m-%d')
code is then wrongly assuming the day as the day prior to the day that the indice refers to. (Although this wouldn't explain misalignment of stocks listed on a German exchange, in which case the times would move forward rather than back).
The issue is not with yahooquery
but with letting pandas coerce the dtype. Instead, if the dataframe has data for only a single symbol then just drop the 'symbol' level with df.droplevel(0)
and then access the remaining index directly. Alternatively the date level of the index can be directly accessed with df.index.levels[1]
.
Hope that helps.
Foremost, thank you for the effort and the detailed explanation.
What I have not quite understood yet is: How can I provoke that a live index is displayed to debug my code? Do I just have to wait for the exchanges to open?
Furthermore, I will take your advice into account and do without df.reset_index()
and test the other method. Calling df.reset_index()
has always felt like a hack to me, only at the time it was the only way I found to achieve what I want.
No worries.
What I have not quite understood yet is: How can I provoke that a live index is displayed to debug my code? Do I just have to wait for the exchanges to open?
Yup. Although if it's to play around then just use a symbol that trades 24 hours, for example a future or a currency.
Can you give me some more hints how to get a live index during test? It just does not work for me:
Here some example: ^GDAXI
df: open high low close volume adjclose
symbol date
^GDAXI 1987-12-30 1005.190002 1005.190002 1005.190002 1005.190002 0.0 1005.190002
1988-01-04 956.489990 956.489990 956.489990 956.489990 0.0 956.489990
1988-01-05 996.099976 996.099976 996.099976 996.099976 0.0 996.099976
1988-01-06 1006.010010 1006.010010 1006.010010 1006.010010 0.0 1006.010010
1988-01-07 1014.469971 1014.469971 1014.469971 1014.469971 0.0 1014.469971
... ... ... ... ... ... ...
2023-02-06 15367.040039 15406.929688 15275.570312 15345.910156 54430400.0 15345.910156
2023-02-07 15359.589844 15363.410156 15273.599609 15320.879883 56184100.0 15320.879883
2023-02-08 15450.669922 15486.129883 15371.929688 15412.049805 61344800.0 15412.049805
2023-02-09 15560.150391 15658.559570 15519.629883 15523.419922 66247500.0 15523.419922
2023-02-11 15438.940430 15487.980469 15246.389648 15350.530273 0.0 15350.530273
QM=F
symbol date
QM=F 2002-06-17 26.299999 26.750000 26.250000 26.340000 1917.0 26.340000
2002-06-18 26.450001 26.450001 25.625000 25.650000 0.0 25.650000
2002-06-19 25.500000 26.174999 25.450001 25.525000 1939.0 25.525000
2002-06-20 25.750000 25.975000 25.375000 25.924999 2404.0 25.924999
2002-06-21 25.750000 26.424999 25.225000 25.875000 1804.0 25.875000
... ... ... ... ... ... ...
2023-02-06 73.324997 74.525002 72.250000 74.099998 13629.0 74.099998
2023-02-07 74.500000 77.599998 74.349998 77.150002 15376.0 77.150002
2023-02-08 77.525002 78.574997 77.074997 78.474998 12726.0 78.474998
2023-02-09 78.474998 78.849998 76.525002 78.050003 12726.0 78.050003
2023-02-10 77.599998 80.324997 77.449997 78.599998 8966.0 78.599998
EURUSD=X
symbol date
EURUSD=X 2003-12-01 1.203398 1.204007 1.194401 1.196501 0.0 1.196501
2003-12-02 1.196101 1.210903 1.194600 1.208897 0.0 1.208897
2003-12-03 1.209000 1.213003 1.207700 1.212298 0.0 1.212298
2003-12-04 1.212004 1.214403 1.204398 1.208094 0.0 1.208094
2003-12-05 1.207802 1.219096 1.206593 1.218695 0.0 1.218695
... ... ... ... ... ... ...
2023-02-06 1.079086 1.080030 1.071834 1.079086 0.0 1.079086
2023-02-07 1.073076 1.074600 1.066951 1.073076 0.0 1.073076
2023-02-08 1.072973 1.076021 1.071685 1.072973 0.0 1.072973
2023-02-09 1.071582 1.079040 1.071157 1.071582 0.0 1.071582
2023-02-11 1.074345 1.075384 1.068490 1.070320 0.0 1.070320
I don't get the live indices and cannot recreate the behavior with the shifted dates
I am definitely running on yahooquery==2.3.0
? I get a live interval for all these (at the moment at least).
>>> import yahooquery as yq
>>> yq.__version__
'2.3.0'
ticker = yq.Ticker("QM=F")
ticker.history(start="2023-02-08")
open | high | low | close | volume | adjclose | ||
---|---|---|---|---|---|---|---|
symbol | date | ||||||
QM=F | 2023-02-08 | 77.525002 | 78.574997 | 77.074997 | 78.474998 | 12726 | 78.474998 |
2023-02-09 | 78.474998 | 78.849998 | 76.525002 | 78.050003 | 12726 | 78.050003 | |
2023-02-10 13:18:31-05:00 | 77.599998 | 80.324997 | 77.449997 | 79.349998 | 12263 | 79.349998 |
Are you processing the returned data? Can you include the full code that you ran to get those returns.
I just copied and pasted your snippet:
import yahooquery as yq
print("yq.__version__", yq.__version__)
ticker = yq.Ticker("QM=F")
print(ticker.history(start="2023-02-08"))
yq.__version__ 2.3.0
open high low close volume adjclose
symbol date
QM=F 2023-02-07 74.500000 77.599998 74.349998 77.150002 15376 77.150002
2023-02-08 77.525002 78.574997 77.074997 78.474998 12726 78.474998
2023-02-09 78.474998 78.849998 76.525002 78.050003 12726 78.050003
2023-02-10 77.599998 80.324997 77.449997 79.625000 12381 79.625000
Somehow, it does not display the live indices
That is odd. Also, that call should not return data for 2023-02-07.
What versions of pandas and numpy are you using? If they aren't the latest versions then could you upgrade them and try again.
I'm assuming the yahooquery
library installation is clean, i.e. no local hacks?
Also, could you let me have your return from pd.Timestamp.now()
.
I created a venv
just like in my dev environment like so
# Create virtualenv with desired python version (here v3.11)
virtualenv -p python3.11 venv; source ./venv/bin/activate
# Install all requirements
./venv/bin/python3.11 -m pip install -r requirements.txt
# requirements.txt
# -----------------------------------------------------------------------------
# Financial provider
# -----------------------------------------------------------------------------
# Yahoo Finance API
# https://github.com/dpguthrie/yahooquery
yahooquery==2.3.0
# https://github.com/ranaroussi/yfinance
# yfinance==0.2.3
# -----------------------------------------------------------------------------
# Dev
# -----------------------------------------------------------------------------
# logs
loguru==0.6.0
# formatter
autopep8==2.0.1
# linting
pylint==2.15.9
Here is the result of venv/bin/python -m pip list
Package Version
------------------ ---------
astroid 2.13.2
autopep8 2.0.1
certifi 2022.12.7
charset-normalizer 2.1.1
dill 0.3.6
idna 3.4
isort 5.11.4
lazy-object-proxy 1.9.0
loguru 0.6.0
lxml 4.9.2
mccabe 0.7.0
numpy 1.24.1
pandas 1.5.2
pip 23.0
platformdirs 2.6.2
pycodestyle 2.10.0
pylint 2.15.9
python-dateutil 2.8.2
pytz 2022.7
requests 2.28.1
requests-futures 1.0.0
setuptools 65.6.3
six 1.16.0
tomlkit 0.11.6
tqdm 4.64.1
typing_extensions 4.4.0
urllib3 1.26.13
wheel 0.38.4
wrapt 1.14.1
yahooquery 2.3.0
These are the versions of the asked libs. I did not install them directly. They are installed as dependencies from some other lib
numpy 1.24.1
pandas 1.5.2
No hacks from my side.
yq.__version__: 2.3.0
pd.__version__: 1.5.2
np.__version__: 1.24.1
pd.Timestamp.now(): 2023-02-10 21:37:39.073704
open high low close volume adjclose
symbol date
QM=F 2023-02-07 74.500000 77.599998 74.349998 77.150002 15376 77.150002
2023-02-08 77.525002 78.574997 77.074997 78.474998 12726 78.474998
2023-02-09 78.474998 78.849998 76.525002 78.050003 12726 78.050003
2023-02-11 77.599998 80.324997 77.449997 79.849998 13089 79.849998
I guess the last value should not be the 11th of February. It was 10th of February when I last posted here. I am in Germany, by the way, Central European Time (CET), if that might help find what's going on.
I'm in the same timezone (at least according to the Spanish state), so I don't think it's to do with that.
I've also run it with Python 3.11 and I'm getting back the live indice as before.
Could you run the following and post the print...
import yahooquery as yq
import pandas as pd
symbol = "QM=F"
ticker = yq.Ticker(symbol)
start = yq.utils._convert_to_timestamp("2023-02-08")
end = yq.utils._convert_to_timestamp(None, start=False)
print(f"{start=}, {end=}")
params = {"period1": start, "period2": end, "interval": "1d"}
data = ticker._get_data("chart", params)
index = data[symbol]["timestamp"]
print(f"{index=}")
dti = pd.to_datetime(index, unit="s")
print(f"{dti=}")
import yahooquery as yq import pandas as pd symbol = "QM=F" ticker = yq.Ticker(symbol) start = yq.utils._convert_to_timestamp("2023-02-08") end = yq.utils._convert_to_timestamp(None, start=False) print(f"{start=}, {end=}") params = {"period1": start, "period2": end, "interval": "1d"} data = ticker._get_data("chart", params) index = data[symbol]["timestamp"] print(f"{index=}") dti = pd.to_datetime(index, unit="s") print(f"{dti=}")
Here is the result:
start=1675810800, end=1676065639
index=[1675746000, 1675832400, 1675918800, 1676064967]
dti=DatetimeIndex(['2023-02-07 05:00:00', '2023-02-08 05:00:00',
'2023-02-09 05:00:00', '2023-02-10 21:36:07'],
dtype='datetime64[ns]', freq=None)
These date time values make sense to me
Here again the output from
import yahooquery as yq
import pandas as pd
import numpy as np
print("yq.__version__:", yq.__version__)
print("pd.__version__:", pd.__version__)
print("np.__version__:", np.__version__)
print("pd.Timestamp.now():", pd.Timestamp.now())
ticker = yq.Ticker("QM=F")
print(ticker.history(start="2023-02-08"))
Output
yq.__version__: 2.3.0
pd.__version__: 1.5.2
np.__version__: 1.24.1
pd.Timestamp.now(): 2023-02-10 22:49:02.618941
open high low close volume adjclose
symbol date
QM=F 2023-02-07 74.500000 77.599998 74.349998 77.150002 15376 77.150002
2023-02-08 77.525002 78.574997 77.074997 78.474998 12726 78.474998
2023-02-09 78.474998 78.849998 76.525002 78.050003 12726 78.050003
2023-02-11 77.599998 80.324997 77.449997 79.824997 13296 79.824997
Just to confirm. that there is some difference in the same environment
At least part of the bug (possibly all of it) is in yq.utils._convert_to_timestamp
. It's been raised before (#49). The reason it's working for me does actually come back to the timezone (the system I'm working on is an hour behind you, i.e. GMT, which is making all the difference given the implementation).
I'll add a fix as another commit to the current PR #141.
In the meantime, try changing the yahooquery.utils._convert_to_timestamp
function in your local install to:
def _convert_to_timestamp(date=None, start=True):
if date is not None:
return int(pd.Timestamp(date).timestamp())
if start:
return int(pd.Timestamp("1942-01-01").timestamp())
return int(pd.Timestamp.now().timestamp())
Let me know how you get on!
I did change the function
# def _convert_to_timestamp(date=None, start=True):
# if date is None:
# date = int((-858880800 * start) + (time.time() * (not start)))
# elif isinstance(date, datetime.datetime):
# date = int(time.mktime(date.timetuple()))
# else:
# date = int(time.mktime(time.strptime(str(date), "%Y-%m-%d")))
# return date
def _convert_to_timestamp(date=None, start=True):
print(f"TODO test new '_convert_to_timestamp' function")
if date is not None:
return int(pd.Timestamp(date).timestamp())
if start:
return int(pd.Timestamp("1942-01-01").timestamp())
return int(pd.Timestamp.now().timestamp())
and ran the following code:
import yahooquery as yq
import pandas as pd
import numpy as np
symbol = "QM=F"
ticker = yq.Ticker(symbol)
start = yq.utils._convert_to_timestamp("2023-02-08")
end = yq.utils._convert_to_timestamp(None, start=False)
print(f"{start=}, {end=}")
params = {"period1": start, "period2": end, "interval": "1d"}
data = ticker._get_data("chart", params)
index = data[symbol]["timestamp"]
print(f"{index=}")
dti = pd.to_datetime(index, unit="s")
print(f"{dti=}")
print("yq.__version__:", yq.__version__)
print("pd.__version__:", pd.__version__)
print("np.__version__:", np.__version__)
print("pd.Timestamp.now():", pd.Timestamp.now())
ticker = yq.Ticker("QM=F")
print(ticker.history(start="2023-02-08"))
Result:
TODO test new '_convert_to_timestamp' function
TODO test new '_convert_to_timestamp' function
start=1675814400, end=1676074563
index=[1675832400, 1675918800, 1676066385]
dti=DatetimeIndex(['2023-02-08 05:00:00', '2023-02-09 05:00:00',
'2023-02-10 21:59:45'],
dtype='datetime64[ns]', freq=None)
yq.__version__: 2.3.0
pd.__version__: 1.5.2
np.__version__: 1.24.1
pd.Timestamp.now(): 2023-02-11 00:16:03.289537
TODO test new '_convert_to_timestamp' function
TODO test new '_convert_to_timestamp' function
open high low close volume adjclose
symbol date
QM=F 2023-02-08 77.525002 78.574997 77.074997 78.474998 12726 78.474998
2023-02-09 78.474998 78.849998 76.525002 78.050003 12726 78.050003
2023-02-11 77.599998 80.324997 77.449997 79.824997 13408 79.824997
I found a python snippet to simulate another timezone:
import os, time
print(time.strftime('%X %x %Z'))
os.environ['TZ'] = 'Europe/London'
time.tzset()
print(time.strftime('%X %x %Z'))
When I run the code again, I also got the live indices now:
00:26:44 02/11/23 CET
23:26:44 02/10/23 GMT
symbol date
QM=F 2023-02-08 77.525002 78.574997 77.074997 78.474998 12726 78.474998
2023-02-09 78.474998 78.849998 76.525002 78.050003 12726 78.050003
2023-02-10 16:59:45-05:00 77.599998 80.324997 77.449997 79.824997 13408 79.824997
👍 The data coming back from yahoo is now as required, so the bug seems to be between line 1279 of ticker.py (the call to _historical_data_to_dataframe in ticker.history) and the end of ticker.history. As you've shown, it's appears to be to do with the timezone.
I'll try and have another look over the weekend.
Great! Let me know if I can provide any help.
My guess is, that you should be able to simulate the behavior that I get via setting my timezone like
import os, time
os.environ['TZ'] = 'Europe/Berlin'
time.tzset()
I think I've found it...
It's a bug in the v2.3 implementation. I wrongly assumed that pd.Timestamp.fromtimestamp
converted a timestamp to a pd.Timestamp
based on UTC. By default it actually does the conversion based on the system timezone. As my system aligns with UTC I didn't realize the assumption was wrong. My bad (sorry!), I should have looked at the doc!
I'll add another commit to #141 to fix. With that, I'm hopeful that all queries raised in this issue, and bugs that have been brought to light by it are now resolved. Let me know otherwise.
To get it working in the meantime replace the two lines above in your local install with the following single line:
last_trade = pd.Timestamp.fromtimestamp(timestamp, tz="UTC")
Cheers for raising it.
Awesome job! I am eager to see if that solves my problem.
Thanks for the fix. Can you foresee when there will be a new version of Yahoo Query?
@maread99 @dpguthrie
I'm not sure this fixes the problem (or indeed manifests a new one).
Implementing the workaround has caused history to look 1 day (I assume whatever period you use) further back.
e.g.
t.history(start='2023-02-22', end=None, interval='1d')
symbol date ... BTC-USD 2023-02-21 ...
EDIT: It seems this error only occurs with tickers that daily close/open at midnight. For me, I've confirmed this with both crypto (BTC-USD) and FX (AUDUSD=X).
All other tickers I'm tracking are on the ASX which opens at 10:00 and closes at 16:00 AEDT, and for these, the history function returns the correct data. From a purely speculative observation, it looks as if close times on 24hr tickers are marked as 00:00:00 for time of closing (which is actually 24hrs prior).
Describe the bug I am using v2.3.0 I have a script that loads the historical data like prices and also stock splits once a day.
It has already happened several times that these split data are shifted by one day and then added to my database accordingly wrong. If I load e.g.: the split data from AMZN, the (correct) result is like this
However, it happens (irregularly) that the data are loaded incorrectly, and then the result is like this
In my database, the old (correct) values are merged with the new ones, which then looks accordingly like this
The code to determine the splits is as follows:
Typically, this works very well. I could never provoke it during testing that the data was loaded incorrectly. It happens strangely only in production that at some point wrong split data is suddenly added to the database.
I am not sure now if I have a bug in the code, the historical data from Yahoo is sometimes not correct, or maybe there is a bug in Yahooquery. I am grateful for any advice.