althonos / InstaLooter

Another API-less Instagram pictures and videos downloader.
GNU General Public License v3.0
1.99k stars 260 forks source link

Instagram limitations \ Time Filtering \ Large Instagram Sets #206

Open irvinm opened 6 years ago

irvinm commented 6 years ago

Library version

Environment

Error description - runtime

I see there have been several recent issues discussing the new limitations imposed by Instagram effectively preventing this application from working on large sets of data. One of the recommendations was to use the "-t" option to limit the number of images downloaded in a session. It appears that simply loading all of the image info counts towards the limit and thus even using the "-t" option will not allow downloading of any photos beyond the imposed limit.

On a side note, the "-t" worked fine for me as I originally did "-t 2017-12-31:2017-01-01" and only got 2017 photos with that job and tried again with "-t 2016-12-31:2016-01-01" and only got 2016 photos until I hit the 1000 photo limit.

In my case, assuming a "page" was 50 items (or so I saw in the code), trying to grab older photos from an account would consistently give me the "Query rate exceeded" around page 20 (which would be 1000 photos) and 0 photos would be downloaded. (I was able to successfully process those original photos in previous runs)

One solution I see, would be to have an option (if even possible) to start the page count at a certain number. (Or build the same queue and ignore the first "x" pages in the queue) Therefore, if I run it once and it successfully processes 20 pages, I could wait for the cool-off period (I am seeing 30 minutes) and then restart the process starting at page 19 or something. This would remove the burden from InstaLooter from tracking\recording the progress and retrying over time starting at the last known location.

Reproducible test case

Expected behaviour

Allow an option to start a new run at a different starting page than 1, or have InstaLooter save the progress and continually retry (set a timeout retry value) in order to maintain progress and not redo all of the photos previously processed.

Actual behaviour

Application can not get beyond around 1000 items with or without the time filter option.

dustryder commented 5 years ago

This behaviour can be fixed by editing out the exception raise on line 76 of pages.py and replacing it with

print("Query rate exceeded (wait before next run)")
time.sleep(1800)
distantstar commented 5 years ago

Honestly, if you don't mind downloading at a slower rate, you can iterate slower, thus alleviating the Query rate limit. Right now, it iterates at an interval of 2 seconds per page. Extend that beyond 2 seconds and you can download any files greater than or equal to 1000.

jpegcoma commented 5 years ago

@distantstar , what time rat do you think will work? I tried several, like 5 or 7, also changing accounts in batch sections, but I still get Query rate exceeded for some of the files. Also, I'm not going deep, I only need 1rst photos but still having this issue.

distantstar commented 5 years ago

@distantstar , what time rat do you think will work? I tried several, like 5 or 7, also changing accounts in batch sections, but I still get Query rate exceeded for some of the files. Also, I'm not going deep, I only need 1rst photos but still having this issue.

@jpegcoma I've been limiting the concurrent downloads to 50 and interval of 300 (5 minutes. I have a feeling greater may be better but it's less than 200 calls and 5 minutes interval may be good enough, but if you have issues, increase the interval number to like 600). You can find more information on Instagram Rate Limiting here, and here.

Pay particular attention here in the above URL:

To avoid rate limiting:

Spread out queries evenly between two time intervals to avoid sending traffic in spikes. Use filters to limit the data response size and avoiding calls that request overlapping data. Use the rate limiting header to dynamically balance your call volume.

It's a much slower process of downloading files but at least in the end, you get the files as opposed to not getting them. :)