Crumb issue: Add support for (EU) Dataprotection consent Page

jgriessler commented 11 months ago

Is your feature request related to a problem? Please describe. CRUMB failures occur when running yahooquery from Europe. Testing shows this is because for queries from Europe Yahoo redirects finance.yahoo.com to a Page to Consent to usage of data: https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_6b0b0161-b473-4d30-bc6f-5cdd007600aa

WIthout ack that page the subsequent call to get the crumb via https://query2.finance.yahoo.com/v1/test/getcrumb fails

Describe the solution you'd like Implement a check to see if yahoo redirects to the CONSENT page. If yes, send an 'Agree' to that page to get the necessary cookies etc.

Sample code that works (but likely needs some tweaking `def setup_session(session: requests.Session): url = "https://finance.yahoo.com" try: response = session.get(url, allow_redirects=True) except SSLError: counter = 0 while counter < 5: try: session.headers = random.choice(HEADERS) response = session.get(url, verify=False) break except SSLError: counter += 1

if not isinstance(session, FuturesSession):

    # check for and handle consent page:w
    if response.url.find('consent'):
        logger.debug(f'Redirected to consent page: "{response.url}"')

        soup = BeautifulSoup(response.content, 'html.parser')

        params = {}
        for param in ['csrfToken', 'sessionId']:
            try:
                params[param] = soup.find('input', attrs={'name': param})['value']
            except Exception as exc:
                logger.critical(f'Failed to find or extract "{param}" from response. Exception={exc}')
                return

        logger.debug(f'params: {params}')

        response = session.post(
            'https://consent.yahoo.com/v2/collectConsent',
            data={
                'agree': ['agree', 'agree'],
                'consentUUID': 'default',
                'sessionId': params['sessionId'],
                'csrfToken': params['csrfToken'],
                'originalDoneUrl': url,
                'namespace': 'yahoo'
            })
        # just assume things are fine and session is setup now

    return session

_ = response.result()
return session

`

Describe alternatives you've considered I'm not aware of any other solution to work around this for queries from Europe.

Additional context

jirisarri10 commented 11 months ago

When I enter from Spain Yahoo forces me to accept cookies and that is the problem. I think it is necessary to press the "ok" button with selenium. The problem I have later is that it gives me many connections when i go https://query2.finance.yahoo.com/v1/test/getcrumb imagen

fredrik-corneliusson commented 11 months ago

@jgriessler This is absolutely fantastic, thank you. Just tested your solution locally and I can now access the problematic API:s from Sweden, and I suspect the rest of EU (GDPR) regulated countries. Are you familiar with forking and making PR:s on github? I think it would be a nicer way for others to review and test the solution instead to manually pasting the code? In any case this is great news, if it is regulations and not a yahoo specific issue. Then it will probably continue to work and not be that much of "whac a moleto" to keep it running. Thanks.

RudyNL commented 11 months ago

Its without VPN working fine for me in the Netherlands. Thanks @jgriessler for the patch. The instruction is a bit troublesome, so a pointwise instruction: 1) Open the file ...../lib/python3.10/site-packages/yahooquery/utils/init.py 2) Add in the header of the file after # third party from bs4 import BeautifulSoup 3) Replace the method def setup_session(session: requests.Session): by

def setup_session(session: requests.Session):
    url = "https://finance.yahoo.com"
    try:
        response = session.get(url, allow_redirects=True)
    except SSLError:
        counter = 0
        while counter < 5:
            try:
                session.headers = random.choice(HEADERS)
                response = session.get(url, verify=False)
                break
            except SSLError:
                counter += 1

    if not isinstance(session, FuturesSession):

      # check for and handle consent page:w
      if response.url.find('consent'):
          logger.debug(f'Redirected to consent page: "{response.url}"')

          soup = BeautifulSoup(response.content, 'html.parser')

          params = {}
          for param in ['csrfToken', 'sessionId']:
              try:
                  params[param] = soup.find('input', attrs={'name': param})['value']
              except Exception as exc:
                  logger.critical(f'Failed to find or extract "{param}" from response. Exception={exc}')
                  return

          logger.debug(f'params: {params}')

          response = session.post(
              'https://consent.yahoo.com/v2/collectConsent',
              data={
                  'agree': ['agree', 'agree'],
                  'consentUUID': 'default',
                  'sessionId': params['sessionId'],
                  'csrfToken': params['csrfToken'],
                  'originalDoneUrl': url,
                  'namespace': 'yahoo'
              })
          # just assume things are fine and session is setup now

      return session

    _ = response.result()
    return session

jirisarri10 commented 11 months ago

Gracias Griessler, Rudy!!! imagen

dpguthrie commented 11 months ago

@jgriessler Really appreciate the solution here! I'll work on putting this in and get it in the next release.

ibart commented 11 months ago

https://consent.yahoo.com/v2/collectConsent is dead, now.

Screenshot_20231217-144244_Firefox

dpguthrie commented 11 months ago

@ibart This is most likely due to the fact that your browser is making a GET request - the url that you're using, and the one used internally, accepts the POST method with a defined body.

jgriessler commented 10 months ago

Thanks everyone for moving this forward (and of course Doug for getting the functionality in) while I was distracted with personal stuff. I've not yet played with github, so would only mess up trying to fork and work a PR.

One other comment - I noticed that things are a little bit slower now when querying data - I assume it's because finance.yahoo.com is just huge, so loading the main site takes time. Going through the consent for every query is also quite some overhead if you run a series of history update queries. So I switched to "reusing" the yq.Ticker() instance , just modifying the ticker.symbols. I do get a fresh instance randomly still to start fresh every 30-50 queries.

dpguthrie / yahooquery

Crumb issue: Add support for (EU) Dataprotection consent Page #247