danieldotnl / ha-multiscrape

Home Assistant custom component for scraping (html, xml or json) multiple values (from a single HTTP request) with a separate sensor/attribute for each value. Support for (login) form-submit functionality.
MIT License
269 stars 14 forks source link

Set-Cookies response header support #319

Closed bjarnekrottje closed 1 week ago

bjarnekrottje commented 7 months ago

Is your feature request related to a problem? Please describe. Hi all, I am running into an issue here regarding a CSRF-validation. This is my config, used to retrieve charging data from the Eneco eMobility portal. When it firstly retrieves the page on which the first form_submit should happen (https://portal.eneco-emobility.com), it will retrieve some Set-Cookie response headers. These cookies need to be set, but it looks like in the form submit the cookie isn't being taken. Is there maybe a possibility to implement this feature?

Describe the solution you'd like I would like the Multiscrape component to honour the Set-Cookie response headers and save them as cookies. This would allow it to set those cookies to all other requests which it does further, like the eventual login and retrieving of the data.

Describe alternatives you've considered I have considered looking at how I can intercept the CSRF-token manually from the body and force it as a header, but unfortunately I have been unsuccessful thus far.

Additional context Here are some additional files which give some more insight into the configuration and the error it gives. Multiscrape Config file

- name: Eneco eMobility scraper
  resource: "https://portal.eneco-emobility.com/mijnladen/dashboard/"
  scan_interval: 1800
  log_response: True
  form_submit:
    submit_once: True
    resource: "https://portal.eneco-emobility.com"
    select: "body > section > form"
    input:
      username: !secret Eneco_eMobility_Username
      password: !secret Eneco_eMobility_Password
  headers:
    referer: "google.com"
  button:
    - unique_id: eneco_emobility_manual_refresh
      name: Eneco eMobility | Manual Refresh
  sensor:
    - unique_id: eneco_emobility_current_month_charged_at_home
      name: Eneco eMobility | Current month | Charged at Home
      state_class: total_increasing
      device_class: energy
      unit_of_measurement: kWh
      select: "body > section > div.tiles.tiles-6 > div.tile.tile-loaded-home > div.tile-title"
      value_template: "{{ value|replace(',', '.') }}"
      on_error:
        value: "last"
    - unique_id: eneco_emobility_current_month_compensation
      name: Eneco eMobility | Current month | Compensation
      state_class: total_increasing
      device_class: monetary
      unit_of_measurement: EUR
      select: "body > section > div.tiles.tiles-6 > div.tile.tile-compensation > div.tile-title"
      value_template: "{{ value|replace(',', '.')|replace('€', '') }}"
      on_error:
        value: "last"

Form Page Response Headers

Headers([('server', 'nginx'), ('date', 'Wed, 17 Jan 2024 08:18:54 GMT'), ('content-type', 'text/html; charset=utf-8'), ('transfer-encoding', 'chunked'), ('connection', 'keep-alive'), ('vary', 'Accept-Encoding'), ('expires', 'Wed, 17 Jan 2024 08:18:54 GMT'), ('cache-control', 'max-age=0, no-cache, no-store, must-revalidate, private'), ('vary', 'Cookie, Accept-Language'), ('x-robots-tag', 'noindex, nofollow'), ('content-security-policy', "default-src 'none'; connect-src 'self'; font-src 'self'; media-src 'self'; manifest-src 'self'; worker-src 'self'; frame-src 'self'; frame-ancestors 'self'; base-uri 'none'; form-action 'self'; style-src 'self'; script-src 'self'; img-src 'self' data:; block-all-mixed-content"), ('x-frame-options', 'DENY'), ('content-language', 'nl'), ('x-content-type-options', 'nosniff'), ('referrer-policy', 'same-origin'), ('cross-origin-opener-policy', 'same-origin'), ('set-cookie', 'csrftoken=PsSxUNXXXXkM9ktyUB85Ge14KjxxxxY8; expires=Wed, 15 Jan 2025 08:18:54 GMT; Max-Age=31449600; Path=/; SameSite=Lax; Secure'), ('strict-transport-security', 'max-age=31536000; includeSubdomains;'), ('content-encoding', 'gzip')])

Form Submit Response Body

<!DOCTYPE html>
<html lang="en">
<head>
  <meta http-equiv="content-type" content="text/html; charset=utf-8">
  <meta name="robots" content="NONE,NOARCHIVE">
  <title>403 Forbidden</title>
  <style type="text/css">
    html * { padding:0; margin:0; }
    body * { padding:10px 20px; }
    body * * { padding:0; }
    body { font:small sans-serif; background:#eee; color:#000; }
    body>div { border-bottom:1px solid #ddd; }
    h1 { font-weight:normal; margin-bottom:.4em; }
    h1 span { font-size:60%; color:#666; font-weight:normal; }
    #info { background:#f6f6f6; }
    #info ul { margin: 0.5em 4em; }
    #info p, #summary p { padding-top:10px; }
    #summary { background: #ffc; }
    #explanation { background:#eee; border-bottom: 0px none; }
  </style>
</head>
<body>
<div id="summary">
  <h1>Verboden <span>(403)</span></h1>
  <p>CSRF-verificatie mislukt. Aanvraag afgebroken.</p>

</div>

<div id="explanation">
  <p><small>Meer informatie is beschikbaar met DEBUG=True.</small></p>
</div>

</body>
</html>
bjarnekrottje commented 7 months ago

@danieldotnl Might there be a possibility to point me into the right direction as to how to fix this issue? I am happy to look into the code myself, but I am not sure how to achieve this the best. Please let me know, thanks in advance!

danieldotnl commented 7 months ago

Bjarne, that would be very welcome! I'm not an expert in http requests but my approach would be as follows:

Good luck!

danieldotnl commented 7 months ago

Also check this PR: https://github.com/danieldotnl/ha-multiscrape/pull/327 Is there a relation?

jeremicmilan commented 7 months ago

I believe my PR will address this problem. I've had a similar issue, where I needed to parse the form-submit output and use the parsed content as headers. Look at the example in the PR. Looking forward to feedback. :)

danieldotnl commented 1 week ago

Please try release 7.1.2 and let me know if it addresses your issue!

bjarnekrottje commented 1 week ago

Hi @danieldotnl,

Completely forgot about this, but tested it with the latest version and it appears to be working. Thank you very much for your work and as well as you @jeremicmilan.

For the people looking to implementing this on their own to scrape the Eneco eMobility portal for this information, use the following configuration:

- name: Eneco eMobility scraper
  resource: "https://portal.eneco-emobility.com/mijnladen/dashboard/"
  scan_interval: 1800
  log_response: False
  headers:
    referer: "https://portal.eneco-emobility.com/"
  form_submit:
    submit_once: True
    headers:
      referer: "https://portal.eneco-emobility.com/"
    resource: "https://portal.eneco-emobility.com/"
    select: "body > section > form"
    input:
      username: !secret Eneco_eMobility_Username
      password: !secret Eneco_eMobility_Password
  button:
    - unique_id: eneco_emobility_manual_refresh
      name: Eneco eMobility | Manual Refresh
  sensor:
    - unique_id: eneco_emobility_current_month_charged_at_home
      name: Eneco eMobility | Current month | Charged at Home
      state_class: total_increasing
      device_class: energy
      unit_of_measurement: kWh
      select: "body > section > div.tiles.tiles-6 > div.tile.tile-loaded-home > div.tile-title"
      value_template: "{{ value|replace(',', '.') }}"
      on_error:
        value: "last"
    - unique_id: eneco_emobility_current_month_compensation
      name: Eneco eMobility | Current month | Compensation
      state_class: total_increasing
      device_class: monetary
      unit_of_measurement: EUR
      select: "body > section > div.tiles.tiles-6 > div.tile.tile-compensation > div.tile-title"
      value_template: "{{ value|replace(',', '.')|replace('€', '') }}"
      on_error:
        value: "last"

Note: The referer is required to make sure the validation succeeds. This way (from version 7.1.2) it works like a charm.

jeremicmilan commented 1 week ago

I'm glad that the feature helped you. :)

danieldotnl commented 1 week ago

Great to hear and thanks for testing within an hour!