estin / pomp

Screen scraping and web crawling framework
https://pomp.readthedocs.org
Other
60 stars 10 forks source link

Comprehensive examples #6

Closed danielnaab closed 8 years ago

danielnaab commented 8 years ago

Pomp looks like a nice and simple design - I'm going to give it a try while migrating an existing Scrapy project to Python 3.

However, I would really like to see some more comprehensive examples in the documentation or the repository.

For instance, a larger project would:

These things can be rolled together by any competent Python dev, but I think demonstrating one or more ways to build a full-scale production deployment might help gain a few users.

estin commented 8 years ago

I agree.

In plans build demo project in separate repo under docker containers via docker-compose

Were would be:

This example show how use Pomp to build distributed robust app.

Do you know what web resource would be the best target for that? And allow legal scraping.

Sorry for my poor English

danielnaab commented 8 years ago

That sounds great. I'm using Django models also, but haven't worked out how to handle queuing yet... Another idea for an example is throttling speed by domain.

Here are some data ideas:

I had an additional question after reading through the code: is using pomp.core.item.Item and pomp.core.item.Field required, or can the crawlers just return dictionaries instead? If not required, maybe they should go in contrib? (I was looking into using voluptuous or marshmallow for schema.)

estin commented 8 years ago

maybe they should go in contrib? (I was looking into using voluptuous or marshmallow for schema.)

Yes! It must be placed in contrib and do not restrict users. Thanks!

pomp.core.item moved to pomp.contrib.item without backward compatibility

Item interface necessary for developing plugged pipelines like pomp.contrib.pipelines.CsvPipeline where order of fields required.

estin commented 8 years ago

big example - Craigslist crawler Soon I will publish screencast.

But this example much more about how build cluster of web crawlers and less about pomp features and it internals.

estin commented 8 years ago

screencast without audio of craigslist crawler

and don`t know when I will write more detailed description and publish it on reddit or others resources

danielnaab commented 8 years ago

Awesome, thanks!