Are possible - Githubissues

qwo commented 9 years ago

This is interesting, I thought when i click through the terms on the site I'm not allowed to scrape it or am I wrong?

Whats the how :) I'm interested!

bschoenfeld commented 9 years ago

Hmm. I didn't see anything about not scraping the site, though I'm not surprised if that language is in there. They definitely made it difficult. If this use is indeed prohibited, I think I'm going to have to advocate civil disobedience. I'm definitely operating within the bounds of the spirit of the law here. I'm not scraping all data from the site and creating a separate database. I'm simply automating the search process as it was intended to be used. This tool was specifically requested by journalists to help them do their jobs. It seems that they have been asking for the tool and the state refuses to deliver. Hence the disclaimer on their front page, "Statewide searches are not possible"

bschoenfeld commented 9 years ago

@stanzheng, please post the terms where it says this is not allowed. I'll get the word out to the users. We won't be fighting this alone. Journalists and open data activists of Virginia will do most of the fighting for us. Now that they've gotten this tool, the state will probably have to pry it from their grasp.

ttavenner commented 9 years ago

I clicked through a bit looking for TOS and I didn't see anything. I did see a Disclaimer, but that mainly covers their liability. They do mention that they monitor for "illegal" activities but don't explicitly define those activities. I also see on the main page these two points:

Please contact us regarding copyright status before publishing or reselling any documents or any images contained on the webpages for this site.
The OES databases are intended for use by the general public. Due to limitations of equipment and bandwidth, they are not intended to be a source for bulk downloads of OES data. Individuals, companies, IP addresses, or blocks of IP addresses who, in effect, deny service to the general public by generating unusually high numbers of daily database accesses (searches, pages, or hits), whether generated manually or in an automated fashion, may be denied access to these servers without notice.

Scraping when prosecuted is generally done under copyright law, but what constitutes copyrightable data is vague and confusing. It all centers on whether it can be considered a creative work. Also, even if the data did fall under copyright modifying it in significant ways could cause it to be considered a derivative work and therefore make it eligible for its own copyright.

bschoenfeld commented 9 years ago

Awesome @ttavenner. Thanks for the input. Sounds like our biggest concern is the usual issue of getting blocked. Each search hammers their server for over two minutes, so it could be something they notice soon. At the same time, our process is completely tied up for that time as well, which is making our server cost pretty high. Today, we have 6 processes running at a prohibitively high cost of about $200 / month.

ttavenner commented 9 years ago

It might be good to see if @waldoj has any advice on better ways to acquire the data. Transforming it and hosting it locally would almost certainly be less expensive than $200/month.

waldoj commented 9 years ago

Yeah, I think scraping would be fine, legally. But getting blocked is a real concern. I suspect that part of why this functionality doesn't exist on the official site is because of the computational demand of a statewide search. It wouldn't surprise me at all if you found yourself blocked. If your service is causing a significant hit on their servers, perhaps even creating downtime, they'd be well justified in doing so.

Code4HR / va-circuit-court-search

Are possible #1