This PR fixes session loading and the "extended_web_scraper" (quotes) example.
Session loading was broken because:
At a given stage, ContextHttp.save() computes a global key {run_id}:session:{key} by passing a local key to make_key() ;
At the following stage, ContextHttp.load_session() retrieves the global key from the previous stage but processes it as if it were a local key and calls make_key on it again.
The resulting weird global key does not exist, hence the session is not loaded and a new one is created.
Concretely, it means that any successful authentication in the login step in a crawler persists but cannot be retrieved by the following steps (the session and its cookies are never found because the key used to load the session is wrong).
This bug was silently affecting the "quotes" example aka extended_web_scraper.
In addition, authentication was not successful in the first place because the POST request sent by quotes:login was missing a hidden input from the login form.
The website is crawled and scraped but the scraped pages don't have the "(Goodreads page)" links that only appear if you're logged in.
This PR fixes session loading and the "extended_web_scraper" (quotes) example.
Session loading was broken because:
ContextHttp.save()
computes a global key{run_id}:session:{key}
by passing a local key tomake_key()
;ContextHttp.load_session()
retrieves the global key from the previous stage but processes it as if it were a local key and callsmake_key
on it again. The resulting weird global key does not exist, hence the session is not loaded and a new one is created.Concretely, it means that any successful authentication in the
login
step in a crawler persists but cannot be retrieved by the following steps (the session and its cookies are never found because the key used to load the session is wrong).This bug was silently affecting the "quotes" example aka
extended_web_scraper
. In addition, authentication was not successful in the first place because the POST request sent byquotes:login
was missing a hidden input from the login form. The website is crawled and scraped but the scraped pages don't have the "(Goodreads page)" links that only appear if you're logged in.