Closed s0kil closed 2 years ago
If you do not mind, could you also mention streaming large files to disk.
https://stackoverflow.com/questions/30267943/elixir-download-a-file-image-from-a-url
Use a custom pipeline to manage the downloading . In your spider, scrape the media urls and pass it as a nested map key. Then pattern match on it.
https://hexdocs.pm/crawly/basic_concepts.html#custom-item-pipelines.
Crawly processes the items sequentially, but for long downloads you might want to offload it to a queue or use an async Task to download it.
@s0kil I think @Ziinc gave a good answer, pipeline is a good way to go! Otherwise, in my own projects, I am downloading media from the parse_item
callback directly. Crawly is a queue management system itself, so technically your worker will just spend a bit more time downloading the image, that's it.
@Ziinc shall we create a pipeline capable for autodowloading images? e.g. {Crawly.Pipelines.DownloadMedia, field: image, dest: /folder_name}
?
@oltarasenko Will downloading a larger file in parse_item
block the spider from continuing to crawl and parse?
No, it does not block the Crawly itself, just one worker which is downloading something, but all other workers are operational. (Comparing it with Scrapy, where non-reactor based downloads will block the world, Crawly operates without problems)
Is it too much to ask for an example project such https://github.com/oltarasenko/crawly-spider-example, saving the blog posts into each individual file?
@oltarasenko sounds like a good idea, i'll think a bit more about the api and update here. I should have time for it in the coming weeks.
@s0kil i think it would be more appropriate to have a how-to article in the docs. There are some inherent issues with having many example repos, such as maintenance and keeping them in sync.
@s0kil could you give some info on how you are working around the downloading of files now?
shall we create a pipeline capable for autodowloading images? e.g.
{Crawly.Pipelines.DownloadMedia, field: image, dest: /folder_name}
?
Just a heads-up - I've started working on such generic pipeline today.
@Ziinc I could not get it working yet.
@oltarasenko I will implement a generic supervised task execution process as mentioned here https://github.com/oltarasenko/crawly/pull/88#issuecomment-626103255 for pipelines to hook into.
Are there any examples of saving multiple files? For example, saving multiple images for each Crawly request. So far I came across the
WriteToFile
pipeline which seems to be used for saving data into one file, CSV, JSON, etc.