ArchiveTeam / NewsGrabber

Grabbing all news.
62 stars 32 forks source link

Handling of paywalled sites #91

Open klslbmmytw opened 7 years ago

klslbmmytw commented 7 years ago

How are they handled? To what degree can they be bypassed without getting in trouble?

Some sites have paywalls that rely on JS. They can either "fail open" or "fail closed". For example, if JS is disabled on dn.se all articles render fine, but if it's enabled "locked articles" will be hidden (after a short delay during which the JS loads)

Example of "fail open": https://www.dn.se/nyheter/nyheter-hem/infrastrukturministern-kommer-traffa-generaldirektor-snarast/

I presume the first type is okay to scrape, since you're not required to run JS. Would it be illegal/taint the results to bypass the latter type?

Some sites have server-side paywalls, but you can get one month/day/week for free by registering an account without providing payment info. Are these okay to scrape?

Some sites have server-side paywalls where you have to provide payment info to register an account, but it can be blatantly fake (card number 1234123412341234, phone number 123123, asdasd goes in the other fields). Are these okay to scrape?

Some sites have server-side paywalls where you have to provide payment info to register an account, and rudimentary validation (luhn number on cc, dob has to be valid date). Are these okay?

Some sites require valid (eg. functioning) payment info, and perform some small $0.01 transaction to verify it. Is it okay to register valid accounts with real info and then use them for scraping?

Some sites have poor login security. Is it okay to get a list of logins and use them? This is blatantly illegal and hard to decentralize, but would information scraped through methods like this (eg. using TOR/proxy and submit anonymously) be accepted?

A lot of news sites have paywalls, and it seems like a shame to not scrape them. Disregarding the issue of technical possibility, what would be acceptable to scrape? Also, some news sites provide .pdf downloads of their paper issues. Is there any project to scrape these?