danieldotnl / ha-multiscrape

Home Assistant custom component for scraping (html, xml or json) multiple values (from a single HTTP request) with a separate sensor/attribute for each value. Support for (login) form-submit functionality.
MIT License
277 stars 14 forks source link

v7 doesnt work when v6.8.1 was with form submit - get 401/406 on form_page #346

Closed sphen13 closed 5 months ago

sphen13 commented 6 months ago

Just updated to 7.0 and my multiscrape is no longer working. I am trying to debug as best as possible but I am a bit lost.

config:

multiscrape:
  - resource: 'https://app.smartoilgauge.com/ajax/main_ajax.php'
    log_response: True
    method: POST
    payload: 'action=get_tanks_list&tank_id=0'
    headers:
      X-Requested-With : XMLHttpRequest
      Content-Type : application/x-www-form-urlencoded
    scan_interval: 3600
    form_submit:
      submit_once: False
      resource: 'https://app.smartoilgauge.com/login.php'
      select: ".content-container"
      input:
        username: xxxxx
        user_pass: 'xxxxx'
    sensor:
      - name: smartoiltank
        unique_id: smartoiltank
        value_template: '{{ value_json.tanks[0].sensor_gallons }}'
        unit_of_measurement: "gal"
        attributes:
          - name: tank_name
            value_template: '{{ value_json.tanks[0].tank_name }}'
          - name: last_updated_time
            value_template: '{{ value_json.tanks[0].sensor_rt }}'
          - name: last_updated_timestamp
            value_template: '{{ value_json.tanks[0].last_read }}'
          - name: battery
            value_template: '{{ value_json.tanks[0].battery }}'

what i notice is that in 7.0 the form_page_request_body.txt actually has data:

action=get_tanks_list&tank_id=0

while in 6.8.1 it only has None. I am not sure if that is why the login page is failing.

I then get no _soup file and no cookie within the form_page_response_headers.txt

is there a way to not send the payload to the login form?

for now rolling back to 6.8.1

fyi this is the error response on the form load:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>406 Not Acceptable</title>
</head><body>
<h1>Not Acceptable</h1>
<p>An appropriate representation of the requested resource could not be found on this server.</p>
<p>Additionally, a 406 Not Acceptable
error was encountered while trying to use an ErrorDocument to handle the request.</p>
</body></html>
sphen13 commented 6 months ago

i was digging into the changed code - and IF it has to do with the payload being included in the form request it looks like it has to do with the new create_http_wrapper function which includes payload and is used by form_submitter now. form_submitter used to be based on form_submit_http which used _create_form_submitter which did not include the payload.

trying to link the the spot, but in init.py - the diff (lines 120->127) show this change: https://github.com/danieldotnl/ha-multiscrape/compare/v6.8.1...v7.0.0#diff-ef3939b67ef606911414862bd6dfd110996f0ce66f2132be54ae782310fd56a6R120-R127

also further down that same file, lines 244-260 show the old form http wrapper which excludes the payload.

hope this helps!

sphen13 commented 6 months ago

i think i have better links here :)

https://github.com/danieldotnl/ha-multiscrape/commit/d7ec3ea7c27ef7ee8c930ed1d64c75bd17ac2cc8#diff-ef3939b67ef606911414862bd6dfd110996f0ce66f2132be54ae782310fd56a6R113-R127

https://github.com/danieldotnl/ha-multiscrape/commit/d7ec3ea7c27ef7ee8c930ed1d64c75bd17ac2cc8#diff-ef3939b67ef606911414862bd6dfd110996f0ce66f2132be54ae782310fd56a6L323-L340

https://github.com/danieldotnl/ha-multiscrape/commit/d7ec3ea7c27ef7ee8c930ed1d64c75bd17ac2cc8#diff-afd1f01beeb448ae56a5d7fd4da63741467f04986d7134e6cd563bc59b335a5dR24-R52

danieldotnl commented 6 months ago

Thanks for submitting the issue and the detailed analysis! Could you provide your debug logs as well? And could you try to remove the payload from the configuration to see if it then goes past the login?

Btw: the payload looks like a parameter string. You could also try to add it behind the resource like: https://app.smartoilgauge.com/ajax/main_ajax.php?action=get_tanks_list&tank_id=0

sphen13 commented 6 months ago

aha - so - yes if i remove the payload, the form_page works properly. I did try changing the resource to contain the query instead of a payload and that part did fail. i suppose the ajax page really does want a payload.

i do believe this confirms that the issue is with the payload being supplied to the form_page.

as far as debug logs this is what i have when failing (not sure it really helps though):

2024-04-03 16:51:52.029 DEBUG (MainThread) [custom_components.multiscrape] # Start loading multiscrape
2024-04-03 16:51:52.029 DEBUG (MainThread) [custom_components.multiscrape] # Reload service registered
2024-04-03 16:51:52.029 DEBUG (MainThread) [custom_components.multiscrape.service] Setting up multiscrape integration level services
2024-04-03 16:51:52.029 DEBUG (MainThread) [custom_components.multiscrape] # Start processing config from configuration.yaml
2024-04-03 16:51:52.029 DEBUG (MainThread) [custom_components.multiscrape] # Found no name for scraper, generated a unique name: Scraper_noname_0
2024-04-03 16:51:52.029 DEBUG (MainThread) [custom_components.multiscrape] Scraper_noname_0 # Setting up multiscrape with config:
2024-04-03 16:51:52.029 DEBUG (MainThread) [custom_components.multiscrape] Scraper_noname_0 # Log responses enabled, creating logging folder: /config/multiscrape/scraper_noname_0/
2024-04-03 16:51:52.379 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Initializing http wrapper
2024-04-03 16:51:52.379 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Initializing form submitter
2024-04-03 16:51:52.379 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Creating scraper
2024-04-03 16:51:52.379 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Initializing scraper
2024-04-03 16:51:52.379 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # Creating ContentRequestManager
2024-04-03 16:51:52.379 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # Creating coordinator
2024-04-03 16:51:52.379 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # Scan interval is 1:00:00
2024-04-03 16:51:52.497 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # New run: start (re)loading data from resource
2024-04-03 16:51:52.497 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # Deleting logging files from previous run
2024-04-03 16:51:52.866 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Starting with form-submit
2024-04-03 16:51:52.866 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Requesting page with form from: https://app.smartoilgauge.com/login.php
2024-04-03 16:51:52.866 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Executing form_page-request with a GET to url: https://app.smartoilgauge.com/login.php with headers: {'X-Requested-With': 'XMLHttpRequest', 'Content-Type': 'application/x-www-form-urlencoded'}.
2024-04-03 16:51:52.900 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_headers written to file: form_page_request_headers.txt
2024-04-03 16:51:53.007 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_body written to file: form_page_request_body.txt
2024-04-03 16:51:53.254 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Response status code received: 406
2024-04-03 16:51:53.266 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_headers written to file: form_page_response_headers.txt
2024-04-03 16:51:53.275 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_body written to file: form_page_response_body.txt
2024-04-03 16:51:53.275 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Error executing GET request to url: https://app.smartoilgauge.com/login.php.
2024-04-03 16:51:53.288 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_headers_error written to file: form_page_response_headers_error.txt
2024-04-03 16:51:53.301 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_body_error written to file: form_page_response_body_error.txt
2024-04-03 16:51:53.301 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Executing page-request with a POST to url: https://app.smartoilgauge.com/ajax/main_ajax.php with headers: {'X-Requested-With': 'XMLHttpRequest', 'Content-Type': 'application/x-www-form-urlencoded'}.
2024-04-03 16:51:53.304 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_headers written to file: page_request_headers.txt
2024-04-03 16:51:53.327 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_body written to file: page_request_body.txt
2024-04-03 16:51:53.535 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Response status code received: 200
2024-04-03 16:51:53.537 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_headers written to file: page_response_headers.txt
2024-04-03 16:51:53.537 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_body written to file: page_response_body.txt
2024-04-03 16:51:53.537 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Response seems to be json. Skip parsing with BeautifulSoup.
2024-04-03 16:51:53.537 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # Data successfully refreshed. Sensors will now start scraping to update.
2024-04-03 16:51:53.537 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Finished fetching multiscrape data in 1.041 seconds (success: True)
2024-04-03 16:51:53.548 DEBUG (MainThread) [custom_components.multiscrape.service] Scraper_noname_0 # Setting up multiscrape configuration level services
2024-04-03 16:51:53.949 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # smartoiltank # Setting up sensor
2024-04-03 16:51:53.949 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # smartoiltank # Start scraping to update sensor
2024-04-03 16:51:53.949 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # smartoiltank # Applying value_template only.
2024-04-03 16:51:53.950 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # smartoiltank # Selected: None
2024-04-03 16:51:53.950 DEBUG (MainThread) [custom_components.multiscrape.entity] Scraper_noname_0 # smartoiltank # Start scraping attributes
2024-04-03 16:51:53.950 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # smartoiltank# tank_name # Applying value_template only.
2024-04-03 16:51:53.950 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # smartoiltank# last_updated_time # Applying value_template only.
2024-04-03 16:51:53.950 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # smartoiltank# last_updated_timestamp # Applying value_template only.
2024-04-03 16:51:53.950 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # smartoiltank# battery # Applying value_template only.
2024-04-03 16:51:53.950 DEBUG (MainThread) [custom_components.multiscrape.entity] Scraper_noname_0 # smartoiltank # Updated sensor and attributes, now adding to HA

after removing the payload form the config i do get better results - but as mentioned, the actual ajax page is not being called properly because it does need payload:

2024-04-03 16:53:48.972 DEBUG (MainThread) [custom_components.multiscrape] # Start loading multiscrape
2024-04-03 16:53:48.972 DEBUG (MainThread) [custom_components.multiscrape] # Reload service registered
2024-04-03 16:53:48.972 DEBUG (MainThread) [custom_components.multiscrape.service] Setting up multiscrape integration level services
2024-04-03 16:53:48.972 DEBUG (MainThread) [custom_components.multiscrape] # Start processing config from configuration.yaml
2024-04-03 16:53:48.972 DEBUG (MainThread) [custom_components.multiscrape] # Found no name for scraper, generated a unique name: Scraper_noname_0
2024-04-03 16:53:48.972 DEBUG (MainThread) [custom_components.multiscrape] Scraper_noname_0 # Setting up multiscrape with config:
2024-04-03 16:53:48.972 DEBUG (MainThread) [custom_components.multiscrape] Scraper_noname_0 # Log responses enabled, creating logging folder: /config/multiscrape/scraper_noname_0/
2024-04-03 16:53:49.436 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Initializing http wrapper
2024-04-03 16:53:49.436 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Initializing form submitter
2024-04-03 16:53:49.436 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Creating scraper
2024-04-03 16:53:49.436 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # Initializing scraper
2024-04-03 16:53:49.436 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # Creating ContentRequestManager
2024-04-03 16:53:49.436 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # Creating coordinator
2024-04-03 16:53:49.436 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # Scan interval is 1:00:00
2024-04-03 16:53:49.717 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # New run: start (re)loading data from resource
2024-04-03 16:53:49.717 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Scraper_noname_0 # Deleting logging files from previous run
2024-04-03 16:53:49.947 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Starting with form-submit
2024-04-03 16:53:49.947 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Requesting page with form from: https://app.smartoilgauge.com/login.php
2024-04-03 16:53:49.947 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Executing form_page-request with a GET to url: https://app.smartoilgauge.com/login.php with headers: {'X-Requested-With': 'XMLHttpRequest', 'Content-Type': 'application/x-www-form-urlencoded'}.
2024-04-03 16:53:49.971 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_headers written to file: form_page_request_headers.txt
2024-04-03 16:53:50.047 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_body written to file: form_page_request_body.txt
2024-04-03 16:53:50.470 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Response status code received: 200
2024-04-03 16:53:50.475 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_headers written to file: form_page_response_headers.txt
2024-04-03 16:53:50.481 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_body written to file: form_page_response_body.txt
2024-04-03 16:53:50.481 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Parse page with form with BeautifulSoup parser lxml
2024-04-03 16:53:50.640 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # The page with the form parsed by BeautifulSoup has been written to file: form_page_soup.txt
2024-04-03 16:53:50.640 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Try to find form with selector .content-container
2024-04-03 16:53:50.640 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Form looks like this: 
2024-04-03 16:53:50.641 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Finding all input fields in form
2024-04-03 16:53:50.641 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Found the following input fields: {'username': None, 'user_pass': None, 'remember': None, 'ccf_nonce': 'pXth8QUKwJ2TnXzzSNMG'}
2024-04-03 16:53:50.641 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Found form action None and method post
2024-04-03 16:53:50.641 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Merged input fields with input data in config. Result: {'username': 'xxxx', 'user_pass': 'xxxx', 'remember': None, 'ccf_nonce': 'pXth8QUKwJ2TnXzzSNMG'}
2024-04-03 16:53:50.641 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Determined the url to submit the form to: https://app.smartoilgauge.com/login.php
2024-04-03 16:53:50.641 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Submitting the form
2024-04-03 16:53:50.641 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Executing form_submit-request with a post to url: https://app.smartoilgauge.com/login.php with headers: {'X-Requested-With': 'XMLHttpRequest', 'Content-Type': 'application/x-www-form-urlencoded'}.
2024-04-03 16:53:50.664 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_headers written to file: form_submit_request_headers.txt
2024-04-03 16:53:50.671 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_body written to file: form_submit_request_body.txt
2024-04-03 16:53:51.244 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Response status code received: 200
2024-04-03 16:53:51.245 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_headers written to file: form_submit_response_headers.txt
2024-04-03 16:53:51.245 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_body written to file: form_submit_response_body.txt
2024-04-03 16:53:51.245 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Form seems to be submitted successfully (to be sure, use log_response and check file). Now continuing to retrieve target page.
2024-04-03 16:53:51.245 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Executing page-request with a POST to url: https://app.smartoilgauge.com/ajax/main_ajax.php?action=get_tanks_list&tank_id=0 with headers: {'X-Requested-With': 'XMLHttpRequest', 'Content-Type': 'application/x-www-form-urlencoded'}.
2024-04-03 16:53:51.245 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_headers written to file: page_request_headers.txt
2024-04-03 16:53:51.246 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # request_body written to file: page_request_body.txt
2024-04-03 16:53:51.268 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # Response status code received: 200
2024-04-03 16:53:51.268 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_headers written to file: page_response_headers.txt
2024-04-03 16:53:51.269 DEBUG (MainThread) [custom_components.multiscrape.http] Scraper_noname_0 # response_body written to file: page_response_body.txt
2024-04-03 16:53:51.269 DEBUG (MainThread) [custom_components.multiscrape.coordinator] Finished fetching multiscrape data in 1.552 seconds (success: True)
2024-04-03 16:53:51.270 DEBUG (MainThread) [custom_components.multiscrape.service] Scraper_noname_0 # Setting up multiscrape configuration level services
2024-04-03 16:53:51.271 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # smartoiltank # Setting up sensor
2024-04-03 16:53:51.271 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # smartoiltank # Start scraping to update sensor
2024-04-03 16:53:51.272 DEBUG (MainThread) [custom_components.multiscrape.form] Scraper_noname_0 # Exception occurred while scraping, will try to resubmit the form next interval.
2024-04-03 16:53:51.272 DEBUG (MainThread) [custom_components.multiscrape.sensor] Scraper_noname_0 # smartoiltank # On-error, set value to None
2024-04-03 16:53:51.272 DEBUG (MainThread) [custom_components.multiscrape.entity] Scraper_noname_0 # smartoiltank # Start scraping attributes
2024-04-03 16:53:51.272 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # smartoiltank# tank_name # Applying value_template only.
2024-04-03 16:53:51.272 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # smartoiltank# last_updated_time # Applying value_template only.
2024-04-03 16:53:51.272 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # smartoiltank# last_updated_timestamp # Applying value_template only.
2024-04-03 16:53:51.272 DEBUG (MainThread) [custom_components.multiscrape.scraper] Scraper_noname_0 # smartoiltank# battery # Applying value_template only.
2024-04-03 16:53:51.272 DEBUG (MainThread) [custom_components.multiscrape.entity] Scraper_noname_0 # smartoiltank # Updated sensor and attributes, now adding to HA
danieldotnl commented 6 months ago

i do believe this confirms that the issue is with the payload being supplied to the form_page.

Yes indeed. Http data shouldn't be shared between the form submit and the scrape request. I will work on a fix.

danieldotnl commented 5 months ago

Fixed in v7.0.2! Let me know if it works for you.

sphen13 commented 5 months ago

indeed it works! thank you!

glkx commented 1 month ago

I was running into a similar issue. After a lot of head scratching found the solution in the release notes. I like to suggest to include the change in the readme file :) Might help some future folk to solve there problem faster. Thanks for all the updates!