Open andreawwenyi opened 4 years ago
Take https://theinitium.com/article/20200329-culture-chan-chuan-xing-photography-taiwan/ as an example, I think it goes like this:
This probably means we need a custom spider to crawl it.
@pm5 Thanks!
Sounds like similar scheme with appledaily
, we might be able to use the same login_discover_spider
. However, have you been able to successfully send POST request withe the credentials to https://api.theinitium.com/api/v1/auth/login/?language=zh-hant? I see that there's request payload with keys "email" and "password" accompanying the POST request, but I always got 401
when doing r = requests.post("https://api.theinitium.com/api/v1/auth/login/?language=zh-hant", json={"email":<email>, "password":<pwd>})
.
@andrea-w-wang from what I see they use basic access authorization (the Authorization: Basic <token>
header) in all requests to the API, and a X-Client-Name: Web
header for their own apps presumably. So something like:
r = requests.post(
"https://api.theinitium.com/api/v1/auth/login/?language=zh-hant",
headers={
"User-Agent": "Mozilla/5.0",
"X-Client-Name": "Web",
"Authorization": f"Basic {credential}",
},
json={"email": username, "password": password},
)
I guess their basic authorization uses information from username, password, some salt, etc. Haven't cracked it yet, so we probably have to save the credentials instead of doing it the normal way with auth
arguments and Session
objects. That could mean we have to manually get new credentials when they expire. They don't really check the user agent header but adding one could probably avoid us popping up on their monitoring service as some Scrapy spider.
BTW in general if you find a HTTP request you want to learn about in the browser devtool, you can do a "Copy as cURL" to get a command to reproduce the request in curl, save it to a shell script, and tweak the command until you find a "mininal" set of arguments that still gets you the desired response. I use Firefox, but I remember Chrome devtool also has this feature.
I guess their basic authorization uses information from username, password, some salt, etc. Haven't cracked it yet, so we probably have to save the credentials instead of doing it the normal way with
auth
arguments andSession
objects.
... I forgot that basic authentication is almost plaintext. echo <credential> | base64 -D
shows that they use the username "anonymous" and a long crypted text, which doesn't seem to change between several login attempts, as password.
the Initium is one website on our tracking list, however it requires user login to see full content of its publications. Right now we have log in credentials, but, by looking at the Network during login, I didn't see any form data/cookies/... that contains the login credentials, hence was not sure how to do the login with scrapy spider.