elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
985 stars 116 forks source link

Downloading Files #77

Closed s0kil closed 2 years ago

s0kil commented 4 years ago

Are there any examples of saving multiple files? For example, saving multiple images for each Crawly request. So far I came across the WriteToFile pipeline which seems to be used for saving data into one file, CSV, JSON, etc.

s0kil commented 4 years ago

If you do not mind, could you also mention streaming large files to disk.

Ziinc commented 4 years ago

https://stackoverflow.com/questions/30267943/elixir-download-a-file-image-from-a-url

Use a custom pipeline to manage the downloading . In your spider, scrape the media urls and pass it as a nested map key. Then pattern match on it.

https://hexdocs.pm/crawly/basic_concepts.html#custom-item-pipelines.

Crawly processes the items sequentially, but for long downloads you might want to offload it to a queue or use an async Task to download it.

oltarasenko commented 4 years ago

@s0kil I think @Ziinc gave a good answer, pipeline is a good way to go! Otherwise, in my own projects, I am downloading media from the parse_item callback directly. Crawly is a queue management system itself, so technically your worker will just spend a bit more time downloading the image, that's it.

@Ziinc shall we create a pipeline capable for autodowloading images? e.g. {Crawly.Pipelines.DownloadMedia, field: image, dest: /folder_name}?

s0kil commented 4 years ago

@oltarasenko Will downloading a larger file in parse_item block the spider from continuing to crawl and parse?

oltarasenko commented 4 years ago

No, it does not block the Crawly itself, just one worker which is downloading something, but all other workers are operational. (Comparing it with Scrapy, where non-reactor based downloads will block the world, Crawly operates without problems)

s0kil commented 4 years ago

Is it too much to ask for an example project such https://github.com/oltarasenko/crawly-spider-example, saving the blog posts into each individual file?

Ziinc commented 4 years ago

@oltarasenko sounds like a good idea, i'll think a bit more about the api and update here. I should have time for it in the coming weeks.

@s0kil i think it would be more appropriate to have a how-to article in the docs. There are some inherent issues with having many example repos, such as maintenance and keeping them in sync.

Ziinc commented 4 years ago

@s0kil could you give some info on how you are working around the downloading of files now?

michaltrzcinka commented 4 years ago

shall we create a pipeline capable for autodowloading images? e.g. {Crawly.Pipelines.DownloadMedia, field: image, dest: /folder_name}?

Just a heads-up - I've started working on such generic pipeline today.

s0kil commented 4 years ago

@Ziinc I could not get it working yet.

Ziinc commented 4 years ago

@oltarasenko I will implement a generic supervised task execution process as mentioned here https://github.com/oltarasenko/crawly/pull/88#issuecomment-626103255 for pipelines to hook into.