do-me / fast-instagram-scraper

A fast Instagram Scraper based on Torpy.
33 stars 7 forks source link

Media from last iteration is not downloaded #2

Closed gschievelbein closed 3 years ago

gschievelbein commented 3 years ago

When --save_media is True, once --max_posts is reached, the program exits without downloading the media.

do-me commented 3 years ago

Thanks for reporting. Seems like when --max_posts is reached, it's killing the media download thread. I'll look into it. For now, a hot and dirty fix would be just to set time.sleep(600) in line 263 to keep the main process alive for additional 10 minutes.

gschievelbein commented 3 years ago

I think that best solution would be to have a separate module that downloads the media given the resulting csv/json from the crawler. Simply waiting a bit longer works well when there is a small number of posts. With a larger number of posts (~100k), downloading media falls behind quite a lot. If we are interested in a list of location ids or hashtags, it doesn't make sense to wait for media download to finish before crawling the next location metadata. A separate module also makes it easier to preprocess the csv/json file to remove unnecessary posts such as spam before we download the media, so we don't waste requests downloading useless data.

do-me commented 3 years ago

I thought about it in the beginning. I have a script doing this already but wanted to create a convenient way to do all in one. Never really tested large-scale media scraping as personally I'm only interested in metadata.

A separate module works only if you don't wait too long as the image URLs scraped from the metadata have a certain timeout. So probably the recommended way of scraping 100k+ posts including images would be to scrape smaller batches and immediately scrape the pics.

I'll clean up the code and publish it here when ready.

gschievelbein commented 3 years ago

Yes, I'm also a bit worried about the timeout. I wrote a separate script to download images from the csv file: https://github.com/gschievelbein/fast-instagram-scraper/blob/img-crawler/app/instagram_image_scraper.py I am creating a folder named after the csv file, and I check the folder for already downloaded images.

do-me commented 3 years ago

The only thing, I could implement to make sure that really every picture is mined is simply wait for the image download to finish before mining the next page of metadata. As Fast Instagram Scraper is focused on speed and efficiency I don't really like the idea of slowing it down that drastically. So it's up to you in the end to do either of those:

  1. Simply try for your use case whether everything is downloaded or whether too many threads are spwan and block your system. In this case you could tweak around any of the wait flags. This is probably already the most elegant solution but slowing down everything.
  2. Don't download the pics with the main script but first download the metadata and afterwards download the pics with a different script separately. Only, after a certain time limit the image links are no longer valid so I 'd recommend to mine metadata in batches. After every batch download the media and so on. This seems like an okish compromise to me.

You can find the image download script on my blog, conveniently downloading from json files.

As I cannot change anything in the code base I'm closing this issue for now.