The-Eye-Team / WallhavenScraper

:sunrise: Scraper for wallhaven.cc
MIT License
11 stars 1 forks source link

Reusing session cookies to scrape NSFW pictures #4

Closed riptl closed 5 years ago

riptl commented 5 years ago

NSFW image pages require a login to view on WallHaven. We tried implementing a login using http.CookieJar as well as serializing cookies by hand to no avail.

This is an example of a protected URL. https://alpha.wallhaven.cc/wallpaper/193

Note that the actual image file is available! We need the metadata/HTML Page

All that's needed to view the image is the correct cookie header, consisting of multiple cookies including a session token. The cookies are set in each server response.

The -u and -p flags are used for authentication.

The master branch uses http.CookieJar for cookies. The cookie-test branch reads the cookies externally and serializes them back together before requests.

If you can get login + viewing NSFW to work on either of the branches, let us know asap and please file a Pull Request or contact https://the-eye.eu


This is an urgent issue, as WallHaven will completely switch their site structure making crawling much harder in under 5 hours.

Any pull requests that bring us closer to fixing this are highly welcome!

nektro commented 5 years ago

I made a gist that grabs directly from the CDN https://gist.github.com/nektro/3a4c25eb66cb0abf24b84c0239acddbb

Example: https://alpha.wallhaven.cc/wallpaper/193 https://wallpapers.wallhaven.cc/wallpapers/full/wallhaven-193.jpg

CorentinB commented 5 years ago

https://gist.github.com/nektro/3a4c25eb66cb0abf24b84c0239acddbb

That's already what we do, the issue isn't that, look at the issue and the code.

riptl commented 5 years ago

@nektro Sorry forgot to mention. This is strictly about the HTML page containing the metadata! Thanks for looking into this

riptl commented 5 years ago

Cookies from Chrome work, the --cookie flag is a workaround: https://github.com/CorentinB/WallhavenScraper/commit/e2215b6f2b9a713f5949a99f0e213bacf29ea0c8

riptl commented 5 years ago

Workaround working ...