Closed gschievelbein closed 3 years ago
Thanks for reporting. Seems like when --max_posts is reached, it's killing the media download thread.
I'll look into it. For now, a hot and dirty fix would be just to set time.sleep(600)
in line 263 to keep the main process alive for additional 10 minutes.
I think that best solution would be to have a separate module that downloads the media given the resulting csv/json from the crawler. Simply waiting a bit longer works well when there is a small number of posts. With a larger number of posts (~100k), downloading media falls behind quite a lot. If we are interested in a list of location ids or hashtags, it doesn't make sense to wait for media download to finish before crawling the next location metadata. A separate module also makes it easier to preprocess the csv/json file to remove unnecessary posts such as spam before we download the media, so we don't waste requests downloading useless data.
I thought about it in the beginning. I have a script doing this already but wanted to create a convenient way to do all in one. Never really tested large-scale media scraping as personally I'm only interested in metadata.
A separate module works only if you don't wait too long as the image URLs scraped from the metadata have a certain timeout. So probably the recommended way of scraping 100k+ posts including images would be to scrape smaller batches and immediately scrape the pics.
I'll clean up the code and publish it here when ready.
Yes, I'm also a bit worried about the timeout. I wrote a separate script to download images from the csv file: https://github.com/gschievelbein/fast-instagram-scraper/blob/img-crawler/app/instagram_image_scraper.py I am creating a folder named after the csv file, and I check the folder for already downloaded images.
The only thing, I could implement to make sure that really every picture is mined is simply wait for the image download to finish before mining the next page of metadata. As Fast Instagram Scraper is focused on speed and efficiency I don't really like the idea of slowing it down that drastically. So it's up to you in the end to do either of those:
wait
flags. This is probably already the most elegant solution but slowing down everything. You can find the image download script on my blog, conveniently downloading from json files.
As I cannot change anything in the code base I'm closing this issue for now.
When --save_media is True, once --max_posts is reached, the program exits without downloading the media.