cmallwitz / Financials-Extension

Extension for LibreOffice Calc to access stock market data
Other
137 stars 17 forks source link

fix(yahoo): Use cURL to bypass cookie consent page #46

Closed klausi closed 2 years ago

klausi commented 2 years ago

Problem: Yahoo now displays a EU cookie consent page and the scraping does not work anymore (at least for me in Europe).

The behavior of yahoo is really strange, because it seems to check the User agent of the request. This works without any cookies:

curl https://finance.yahoo.com/quote/EUR=X?p=EUR=X -o test.html

Then test.html contains the correct price page.

Wget does not work and returns a 404???

wget "https://finance.yahoo.com/quote/EUR=X?p=EUR=X" 
--2022-01-06 23:24:07--  https://finance.yahoo.com/quote/EUR=X?p=EUR=X
Resolving finance.yahoo.com (finance.yahoo.com)... 188.125.89.204, 188.125.89.206, 2a00:1288:f03d:1fa::4000, ...
Connecting to finance.yahoo.com (finance.yahoo.com)|188.125.89.204|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-01-06 23:24:08 ERROR 404: Not Found.

But then when I fake the User agent to look like cURL it works:

wget -U "curl/7.74.0" "https://finance.yahoo.com/quote/EUR=X?p=EUR=X" 
--2022-01-06 23:24:51--  https://finance.yahoo.com/quote/EUR=X?p=EUR=X
Resolving finance.yahoo.com (finance.yahoo.com)... 188.125.89.206, 188.125.89.204, 2a00:1288:f03d:1fa::4000, ...
Connecting to finance.yahoo.com (finance.yahoo.com)|188.125.89.206|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘EUR=X?p=EUR=X’

EUR=X?p=EUR=X                                                     [  <=>                                                                                                                                            ] 716,47K  3,37MB/s    in 0,2s    

2022-01-06 23:24:53 (3,37 MB/s) - ‘EUR=X?p=EUR=X’ saved [733661]

So it seems like with a user agent that looks like cURL you can bypass the cookie consent wall.

I tried that with the urlopen() call in Python, but setting the cURL user agent did not work for me. As a workaround hack I just called the cURL binary from Python and then the prices work normally again for me.

How can you configure the Python HTTP client to behave the same way as CURL here?

cmallwitz commented 2 years ago

The Python HTTP client should be able to do whatever cURL is doing.

I have tried using the projects example.ods file while being VPNed to DE and AT and Yahoo seems to be working fine. So even if Yahoo fiddled with its EU cookie handling (quite likely) - how is the problem manifesting itself? What would I need to reproduce it?

klausi commented 2 years ago

The problem manifests in Libreoffice by not being able to calculate anything, for example for =GETREALTIME("EUR=X",21,"YAHOO"). When executing the tests with python3 -m unittest discover src I see test fails like this:

FAIL: test_realtime_UK_ETF (test_yahoo.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "Financials-Extension/src/test_yahoo.py", line 175, in test_realtime_UK_ETF
    self.assertEqual(float, type(s), 'test_realtime_UK_ETF LAST_PRICE {}'.format(s))
AssertionError: <class 'float'> != <class 'NoneType'> : test_realtime_UK_ETF LAST_PRICE None

Then I looked into the downloaded HTML in ~/.financials-extension and found that it contains the Yahoo cookie consent page instead of the target page.

When you are using the DE VPN and call the wget command from above - what response do you get?

If you compare the Yahoo page https://finance.yahoo.com/quote/EUR=X?p=EUR=X via a proxy site you can see that from EU servers you land on the cookie consent page while in the US you see the target price page https://eu7.proxysite.com/process.php?d=Kwg1zjk8tgVKD3k2tvLCCY65GbJH1%2BliRQo8gxe1cPrpZp5yC6YJ8RJyTA5llBkim2zqg2%2F9psjqcUjbcgxdtWze2JVeK5EwHLTdNZiC44T0XsDFfw42OLUA7s5UzBFNMX2x&b=1 https://us1.proxysite.com/process.php?d=Kwg1zjw6thdBAmg2tvLCCY65GbJH1%2B4lBR02wD6FQbPyNoA8K50v8zk%3D&b=1&f=norefer

I think there are 2 things going on at the Yahoo site:

  1. Geosplitting: different pages returned for US and EU
  2. User Agent sniffing: different pages returned depending if you have "cURL" in the user agent or not

Maybe your VPN test did not work correctly? Did you clear all cookies in your browser before testing from an EU IP address?

Thanks for checking in any case - love your extension. With this hack workaround I got it at least working again now for me :)

cmallwitz commented 2 years ago

Got it! I wasn't running unit tests from command line. But once I did and was using VPNs to DE/AT I could reproduce the problem.

The extension uses a set of hard coded EU consent cookies that needed updating. The reason this is done instead of doing two/three HTTP round trips is to keep the network overhead to a minimum. That is better than spawning a separate curl process as well.

BTW: From AT you may be better off using the FT lookups - they may be closer to you than Yahoo (assuming Yahoo is hosted in US)

Once I updated the cookies with fresh ones while on VPN, the unit tests passed even when connect to a bunch of EU countries.

You can update your repo and/or use the new 3.0.6 release available from GitHub.

Thank the heavens for NordVPN :-)

klausi commented 2 years ago

3.0.6 works, thanks a lot!

I see you updated a different set of cookie combination than I tried, glad it works! I was just pissed off that a cURL request from the command line would give me the desired page, so did not try further and shoved the subprocess call in :-D

FT lookups: Will try them next time when I see problems with Yahoo again!