disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

the initium log in issue #96

Open andreawwenyi opened 4 years ago

andreawwenyi commented 4 years ago

the Initium is one website on our tracking list, however it requires user login to see full content of its publications. Right now we have log in credentials, but, by looking at the Network during login, I didn't see any form data/cookies/... that contains the login credentials, hence was not sure how to do the login with scrapy spider.

pm5 commented 4 years ago

Take https://theinitium.com/article/20200329-culture-chan-chuan-xing-photography-taiwan/ as an example, I think it goes like this:

  1. A POST request to https://api.theinitium.com/api/v1/auth/login/?language=zh-hant contains login credentials, followed by
  2. A GET request to https://api.theinitium.com/api/v1/article/detail/?language=zh-hant&slug=20200329-culture-chan-chuan-xing-photography-taiwan that was responded with article contents in JSON.

This probably means we need a custom spider to crawl it.

andreawwenyi commented 4 years ago

@pm5 Thanks! Sounds like similar scheme with appledaily, we might be able to use the same login_discover_spider. However, have you been able to successfully send POST request withe the credentials to https://api.theinitium.com/api/v1/auth/login/?language=zh-hant? I see that there's request payload with keys "email" and "password" accompanying the POST request, but I always got 401 when doing r = requests.post("https://api.theinitium.com/api/v1/auth/login/?language=zh-hant", json={"email":<email>, "password":<pwd>}).

pm5 commented 4 years ago

@andrea-w-wang from what I see they use basic access authorization (the Authorization: Basic <token> header) in all requests to the API, and a X-Client-Name: Web header for their own apps presumably. So something like:

r = requests.post(
    "https://api.theinitium.com/api/v1/auth/login/?language=zh-hant",
    headers={
        "User-Agent": "Mozilla/5.0",
        "X-Client-Name": "Web",
        "Authorization": f"Basic {credential}",
    },
    json={"email": username, "password": password},
)

I guess their basic authorization uses information from username, password, some salt, etc. Haven't cracked it yet, so we probably have to save the credentials instead of doing it the normal way with auth arguments and Session objects. That could mean we have to manually get new credentials when they expire. They don't really check the user agent header but adding one could probably avoid us popping up on their monitoring service as some Scrapy spider.

pm5 commented 4 years ago

BTW in general if you find a HTTP request you want to learn about in the browser devtool, you can do a "Copy as cURL" to get a command to reproduce the request in curl, save it to a shell script, and tweak the command until you find a "mininal" set of arguments that still gets you the desired response. I use Firefox, but I remember Chrome devtool also has this feature.

pm5 commented 4 years ago

I guess their basic authorization uses information from username, password, some salt, etc. Haven't cracked it yet, so we probably have to save the credentials instead of doing it the normal way with auth arguments and Session objects.

... I forgot that basic authentication is almost plaintext. echo <credential> | base64 -D shows that they use the username "anonymous" and a long crypted text, which doesn't seem to change between several login attempts, as password.