Add download_carousel - Githubissues

stefco commented 3 years ago

Description

Commits fb0c968 and b51f725:

Adds a download_carousel method for Posts which allows you to download all media on carousel posts, i.e. posts with multiple images/videos, as raised in #105. Since this is a batch operation, you specify an output directory and a function for calculating the filename for each output image instead of specifying a single output filename. See the method documentation for details.

Also added a couple of supporting methods, though only one of them, parse_carousel_urls, is public; this method simply returns the video and image URLs for each image in the carousel, or None if the post is not a carousel. Again, see docstring for details.

Also added the beginnings of a demo jupyter notebook.

Fixes #105

Commit e88032e:

Post.get_recent_comments would raise a KeyError when using Selenium or a requests.Session object to scrape a Post due to slight differences in the structure of the resulting json_dict. I added an except block to handle this and try the alternative json_dict schema.

Fixes #124

Commit f443435:

Add Profile.iter_posts to get a lazy iterator over posts, and reimplement Profile.get_posts (with the same API) using iter_posts.

Fixes #127

Checklist

[x] I followed the guidelines in our Contributing document
[x] I added an explanation of my changes
[ ] I have written new tests for my changes, as applicable
[ ] I successfully ran tests with my changes locally

Additional notes (optional)

Have not written automated tests yet, will do so soon.

kyrlon commented 2 years ago

I have attempted to run the added function download_carousel with the following Google Instagram post, but come across a TypeError.

from instascrape import *
from pathlib import Path

def insta_scrape(ig_links):
    for link in ig_links:
        post = link.split("?")[0] if "copy_link" in link else link
        post_folder = Path(post.split("/")[-2])
        post_folder.mkdir(parents=True, exist_ok=True)

        google_post = Post(post)
        google_post.download_carousel(str(post_folder), allow_non_carousel=True)

if __name__ == "__main__":
    link_list = list()   
    ig_link = "https://www.instagram.com/p/CXuAeZ1ltCa/?utm_source=ig_web_copy_link"
    link_list.append(ig_link)
    insta_scrape(link_list)

Running the following code above, I get the following Traceback:

$ py insta_scrape.py 
Traceback (most recent call last):
  File "insta_scrape.py", line 18, in <module>
    insta_scrape(link_list)
  File "insta_scrape.py", line 11, in insta_scrape
    google_post.download_carousel(str(post_folder), allow_non_carousel=True)
  File "...\insta_download\py_venv\lib\site-packages\instascrape\scrapers\post.py", line 213, in download_carousel  
    urls = self.parse_carousel_urls()
  File "..\insta_download\py_venv\lib\site-packages\instascrape\scrapers\post.py", line 157, in parse_carousel_urls
    is_videos = self._filter_get(self.flat_json_dict, self._IS_VIDEO_KEYS)
  File "..\insta_download\py_venv\lib\site-packages\instascrape\scrapers\post.py", line 133, in _filter_get        
    return [(k, dic[k]) for k in keys if k in dic]
  File "..\insta_download\py_venv\lib\site-packages\instascrape\scrapers\post.py", line 133, in <listcomp>
    return [(k, dic[k]) for k in keys if k in dic]
TypeError: argument of type 'NoneType' is not iterable
(py_venv)

Stepping through with the debugger, I noticed that the variable self.flat_json_dict has the value of None. I am not sure if anyone else has come across such an error.

kyrlon commented 2 years ago

I had forgotten to perform the scrape method. With the new line, the function now operates as intended.

from instascrape import *
from pathlib import Path

def insta_scrape(ig_links):
    for link in ig_links:
        post = link.split("?")[0] if "copy_link" in link else link
        post_folder = Path(post.split("/")[-2])
        post_folder.mkdir(parents=True, exist_ok=True)

        google_post = Post(post)
        google_post.scrape()
        google_post.download_carousel(str(post_folder), allow_non_carousel=True)

if __name__ == "__main__":
    link_list = list()   
    ig_link = "https://www.instagram.com/p/CXuAeZ1ltCa/?utm_source=ig_web_copy_link"
    link_list.append(ig_link)
    insta_scrape(link_list)

chris-greening / instascrape

Add download_carousel #125

Description

Checklist

Additional notes (optional)