Seaweedfs / Gocolly integration

ghost commented 4 years ago

Hi all,

Hope you are all well !

Just was wondering if it is possible to integrate these two awesome tools to crawlab, it would be awesome for storing millions of static objects and to scrape with golang. We already did that with a friend https://github.com/lucmichalski/peaks-tires but we lack of the horizontal scaling and a crawling management interface. That s why, and how, we found crawlab.

https://github.com/gocolly/colly Elegant Scraper and Crawler Framework for Golang
https://github.com/chrislusf/seaweedfs SeaweedFS is a simple and highly scalable distributed file system, to store and serve billions of files fast! SeaweedFS implements an object store with O(1) disk seek, transparent cloud integration, and an optional Filer supporting POSIX, S3 API, AES256 encryption, Rack-Aware Erasure Coding for warm storage, FUSE mount, Hadoop compatible, WebDAV.

Thanks for your insights and feedbacks on the topic.

Cheers, X

hantmac commented 4 years ago

@tikazyq Integrate gocolly into crawlab, how about it ?

tikazyq commented 4 years ago

@x0rzkov First of all, thanks for opening up this issue.

I have been long aware of Colly and it's a great web crawler framework written in go. As you may be aware, we haven't included Golang into our runtime environment as Golang spiders can be packaged as executables and uploaded to Crawlab, which could be run with a shell command. However, the compiling environment for Golang (i.e. you can edit go files and compile and run as you go) is still not figured out. Not sure if this is what you'd like.

For the SeaweedFS thing, this is my first time to hear this system, therefore not quite sure how to use. Is it possible you could let us know how you would like to use it? I guess this is something to do with the spider dev part, not spider management.

Hope this would help.

tikazyq commented 4 years ago

@tikazyq Integrate gocolly into crawlab, how about it ?

Let's hear what he wants

ghost commented 4 years ago

We would like to scrape in a distributed way medias and data from several sources.

To set a real-world case, please refer to:

The source code: https://github.com/lucmichalski/peaks-tires
The demo: https://pneus-illico.com/admin/ (peaks/peaks)

So, we are scrapping lots of data from several website, for each websites, we have create a go plugin (*.so).

https://github.com/lucmichalski/peaks-tires/tree/master/plugins

For crawling, we are using gocolly, code example here, and we extract several media from the html content (for the admin output please click here). That's where seaweed-fs is useful as it could store millions of these media files with an api. We could distribute the media scraping in a distributed storage. (That's for the point one)

Using gocolly, we define the css selectors in order to scrape the meta data, and Qor framework for the admin.

Using gocolly as a golang plugin could be something interesting for crawlab as if you can compile the go plugins, it would help to keep the interesting features of gocolly like the queue, cloning a collector, and all the tricks we did for peaks-tires.

To be clear, today, we seek to scale horizontally what we have done for peaks-tires and use an admin interface to manage the next level of complexity.

Hopefully, it is clear enough.

Cheers, X

tikazyq commented 4 years ago

We would like to scrape in a distributed way medias and data from several sources.

To set a real-world case, please refer to:

The source code: https://github.com/lucmichalski/peaks-tires

The demo: https://pneus-illico.com/admin/ (peaks/peaks)

So, we are scrapping lots of data from several website, for each websites, we have create a go plugin (*.so).

https://github.com/lucmichalski/peaks-tires/tree/master/plugins

For crawling, we are using gocolly, code example here, and we extract several media from the html content (for the admin output please click here). That's where seaweed-fs is useful as it could store millions of these media files with an api. We could distribute the media scraping in a distributed storage. (That's for the point one)

Using gocolly, we define the css selectors in order to scrape the meta data, and Qor framework for the admin.

https://pneus-illico.com/admin/vehicle_images

https://pneus-illico.com/admin/manufacturers

Using gocolly as a golang plugin could be something interesting for crawlab as if you can compile the go plugins, it would help to keep the interesting features of gocolly like the queue, cloning a collector, and all the tricks we did for peaks-tires.

To be clear, today, we seek to scale horizontally what we have done for peaks-tires and use an admin interface to manage the next level of complexity.

Hopefully, it is clear enough.

Cheers, X

We cannot login to your admin interface.

ghost commented 4 years ago

peaks/peaks

tikazyq commented 4 years ago

peaks/peaks

I think your idea is more clear to me now. I think what we are trying to do is to make it easier for users to develop and manage crawlers. I think your requirement is to integrate Colly into Crawlab, which is currently totally feasible, only requiring a bit tweak into it.

I guess you may find the documentation (Chinese) helpful. You can use Google Translate if you can't read Chinese and we do plan to make it international.

Basically the only thing you need to do is to read environment variables passed from Crawlab and save it to MongoDB database where Crawlab is operating. Yet you can still pass this part and do it with whatever you want.

You can also join us on Telegram to discuss further. https://t.me/crawlabgroup

ghost commented 4 years ago

Thanks, I will fork crawlab and try some experiments. :-)

tikazyq commented 4 years ago

@x0rzkov Great, looking forward to your results

tikazyq commented 3 years ago

SeaweedFS integrated in https://github.com/crawlab-team/crawlab/releases/tag/v0.6.0-beta.20210803

crawlab-team / crawlab

Seaweedfs / Gocolly integration #652