OCHA-DAP / Data-Team

A place for tracking data team issues
0 stars 1 forks source link

Finish the 'Building a Scraper Strategy' document. #35

Closed luiscape closed 10 years ago

luiscape commented 10 years ago

A significant part of our data collection process rely on scrapers build by ScraperWiki. We need a scraper strategy to manage the building of new scrapers, the maintenance of old scrapers, and the day-to-day management of scrapers.

JavierTeran commented 10 years ago

Copy/Paste Toughs from Luis Hi David,

I was thinking about that yesterday. Due to sparse documentation it took me a few hours to understand how the ScraperWiki platform works. Now that I understand it a little better, it seems to be quite a powerful platform for building scrapers because: (a) it runs an Ubuntu virtual environment for every "tool", i.e. scraper; (b) there are a few tools that work quite well already such as the export tool and the "push to a CKAN instance" tool. My impression is that you can do a lot with data there: scrape, collect, transform, store, and push somewhere.

As of (a), the advantage I see is that it allows people to write scrapers in virtually any language. There is an onus on our end to create standards for such scrapers and to debug those that are nor working, but at least in principle if we work we highly capable people like Andrew R. we could have them build scrapers there and monitor them with a friendly interface. That is, anyone from the data team, from Javier to Sam, would be able to schedule, run the scrapers and potentially push it to CKAN.

With that said, especially about point (b) above, I would like to explore ScraperWiki a little more. They do have the "push to a CKAN instance" tool in their platform, but I haven't tested it. It seems that we could edit that tool to push data to our instance. However, I have no idea how it can be adapted.

If having a considerable number of scrapers is part of our strategy I would seriously consider ScraperWiki's platform. There another platform I've tested called Morph.io, but it is in its early development days -- and it seems much more limited.

In short, right now I think it isn't a bad idea to consider them for managing the collection, transformation, and pushing to CKAN of uncurated data.

I will let you know if I discover anything worth consideration about using their platform.

Best, // Luis

enw commented 10 years ago

Thx for sharing, Javier. (I'm getting these comments mailed to me...)

Both ScraperWiki and Morph use LXC / Linux Containers as their virtualization technology. LXC is really light-weight (and arguably safer) compared to older approaches to virtualization because it doesn't require a separate hypervisior and uses a single kernel to support multiple VMs into separate namespaces with cgroups & chroot.

The most interesting thing about Morph.io is that they're packaging the scrapers as a Heroku-like service. They hope to keep scrapers free for most users but you may want to reach out to them (or to the ScraperWiki folks) if you plan on using their components as part of HDX. -e

On Tue, May 6, 2014 at 3:15 PM, Javier Teran notifications@github.comwrote:

Copy/Paste Toughs from Luis Hi David,

I was thinking about that yesterday. Due to sparse documentation it took me a few hours to understand how the ScraperWiki platform works. Now that I understand it a little better, it seems to be quite a powerful platform for building scrapers because: (a) it runs an Ubuntu virtual environment for every "tool", i.e. scraper; (b) there are a few tools that work quite well already such as the export tool and the "push to a CKAN instance" tool. My impression is that you can do a lot with data there: scrape, collect, transform, store, and push somewhere.

As of (a), the advantage I see is that it allows people to write scrapers in virtually any language. There is an onus on our end to create standards for such scrapers and to debug those that are nor working, but at least in principle if we work we highly capable people like Andrew R. we could have them build scrapers there and monitor them with a friendly interface. That is, anyone from the data team, from Javier to Sam, would be able to schedule, run the scrapers and potentially push it to CKAN.

With that said, especially about point (b) above, I would like to explore ScraperWiki a little more. They do have the "push to a CKAN instance" tool in their platform, but I haven't tested it. It seems that we could edit that tool to push data to our instance. However, I have no idea how it can be adapted.

If having a considerable number of scrapers is part of our strategy I would seriously consider ScraperWiki's platform. There another platform I've tested called Morph.io, but it is in its early development days -- and it seems much more limited.

In short, right now I think it isn't a bad idea to consider them for managing the collection, transformation, and pushing to CKAN of uncurated data.

I will let you know if I discover anything worth consideration about using their platform.

Best, // Luis

— Reply to this email directly or view it on GitHubhttps://github.com/OCHA-DAP/Data-Team/issues/35#issuecomment-42346073 .

luiscape commented 10 years ago

@enw Indeed. I sent a few emails to Hanare, one of the core developers of Morph.io. My conclusion is that the service is in early-development days. It is also much more limited than ScraperWiki's platform. The main difference here is that SW actually gives you access to the virtual machine via SSH. Morph only allows you to run a scraper directly from GitHub. In SW you can build a number of tools to display and export your data. On Morph you can't.

For those reasons and the fact that we are already using SW, I would think that continuing with SW is a great option. Let's see how Morph continues to be developed though.

I'll add all those comments to the 'Building a Scraper Strategy' document. Thanks for contributing!

luiscape commented 10 years ago

Closing as the scope of the task has changed.