chris-greening / instascrape

Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically
https://chris-greening.github.io/instascrape/
MIT License
630 stars 107 forks source link

Function get_posts return always 12 posts #79

Closed Matteoo98 closed 3 years ago

Matteoo98 commented 3 years ago

Hi, sorry for bothering you again. As you see in the title each time i run get_posts on a instagram profile it return always 12 posts. My code is : sessionid = '********************

    lista = []
    # headers = {"User-Agent": "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
    # "cookie": f"sessionid={os.environ.get('sessionid')};"}
    headers = {
        "User-Agent": "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
        "cookie": f"sessionid={sessionid};"}
    options = Options()
    options.add_argument('--headless')
    driver = webdriver.Chrome(options=options)
    target = 'https://www.instagram.com/molteni_matteo/'
    # target = 'https://www.instagram.com/lubecreopratolacasteldisangro/'
    insta_profile = Profile(target)
    insta_profile.scrape(headers=headers)
    insta_profile.url = target
    list_post = insta_profile.get_posts(webdriver=driver,amount=13)
    print(insta_profile.followers)
    print("Numero post : " + str(len(list_post)))
    for profile_post in list_post:
        profile_post.scrape(headers=headers)
        lst = get_image_urls(profile_post)
        # for y in lst:
        # print("LINK INSIDE POST : ", y)
        # html = profile_post.embed()
        # soup = BeautifulSoup(html, "html.parser")
        # href = None
        # for a in soup.find_all('a', href=True):
        # href = a['href']
        # break
        # sep = '?'
        # href = href.split(sep, 1)[0]
        # href = href+"?__a=1"
        # print('href : ', href)
        # response = requests.get(href,headers=headers).json()
        # json_data = json.loads(response.text)
        # print(response)
        # print(response.get('edge_sidecar_to_children'))
        post_dict = profile_post.to_dict(metadata=False)
        post_dict['images_links'] = lst
        lista.append(post_dict)
    for x in lista:
        print(x)
    print('fine')`

list_post = insta_profile.get_posts(webdriver=driver,amount=13) Even if i put None on amount or i don't specify amount it gives me always 12 posts. How can i solve this problem ? And sorry to disturb you again.

tarob0ba commented 3 years ago

Hm... this is a tough one. While get_recent_posts can only return up to 12, you're using plain get_posts, so that shouldn't be a problem. I'll see if I can replicate this. Time to get use the webdriver for the first time. šŸ˜„

By the way, you're not disturbing anyone here. šŸ™‚

ping: @chris-greening

Matteoo98 commented 3 years ago

Thank you very much, i hope you'll find a solution because i really need this library for a project. Let me know as soon as you can if you find a solution to this problem šŸ™ .

tarob0ba commented 3 years ago

I hate to say this, but I'm helpless right now because I don't have an Instagram account and I can't confirm that a change "works" because I get an InstagramLoginRedirectError 100% of the time. I'll keep looking into the source, but you're probably going to have to wait for Chris to help. Sorry for any inconvenience, but I'll keep trying.

Matteoo98 commented 3 years ago

In fact i have created a fake instagram account for my purpose because without an account the library doesn't work. Anyway thanks for your help and don't worry , i will wait the help of @chris-greening .

chris-greening commented 3 years ago

No disturbance at all!

You'll have to login manually once the webdriver opens. get_posts has a login_first parameter that will take you to the login page for 60 seconds prior to scraping if you pass it as True. Try modifying your get_posts to this line of code

    list_post = insta_profile.get_posts(webdriver=driver, amount=13, login_first=True)

I also recommend adding a time.sleep for ~10 seconds between each scrape if you're able! I know it's inconvenient but Instagram started blocking me after a couple hundred scrapes if I didn't pause for a bit between each one.

Matteoo98 commented 3 years ago

This could be a very problem for me because i need an autonomous software and i can't login manually... Is there a way to avoid the manually login ? Or if it is mandatory how can i pass username and password to the driver to do it automatically ? Is not possible to use the header with the session id like in the scrape method ? I have read that with selenium i can compile the form passing username and password and click the button login. What do you think about implement in the library this possibility ?

driver.get(getdriver) driver.find_element_by_xpath("//input[@name='username']").send_keys(username) driver.find_element_by_xpath("//input[@name='password']").send_keys(password) driver.find_element_by_xpath("//button[contains(.,'Log in')]").click()

if i could give username and password to the get_post method which will contains the above lines of code maybe it could be achieved ?

chris-greening commented 3 years ago

I've thought about implementing an automatic login script but the problem with simply getting the elements and sending the keys is that Instagram will know pretty easily that you're botting. They will likely require some sort of 2FA to verify you're a human using the platform and I've even heard of people's accounts getting temporarily suspended.

Unfortunately get_posts needs more than just the session ID because it's opening a browser and that browser needs to get authenticated with a unique session ID from Instagram which can only be acquired through logging in.

There are a couple things you can try though!

Unfortunately, I haven't really messed around with autonomous logging in so most of this is speculation. Webdriver support as well as instascrape itself is young so it could be a while, if ever, that I get around to supporting it OOTB, definitely check out InstaPy though as they're a significantly more mature lib and I'm pretty sure they have some form of login support that you might be able to implement.

Let me know if you figure it out! I'll try to do some more research on my own but I don't know if I have the time to realize a full solution

Matteoo98 commented 3 years ago

Ok thanks for your help , i think the problem is going to be bigger than my ability but thank you for all šŸ’Æ

Matteoo98 commented 3 years ago

Hi @chris-greening , after a lot of work i finally realized my purpose. I was able to use the get_post method to get all posts on a page and not just 12. What I did was use this extension [https://chrome.google.com/webstore/detail/cookie-editor/hlkenndednhfkekhgcdicdfddnkalmdm]() to get the json of the cookies of a session opened on instagram where I was logged in (but could be implemented with pickle). I saved the json on the project directory and then I put the json inside the driver that uses the get_post method so it used a session where I was already authenticated. This is the code :

driver.get('https://www.instagram.com')
with open('C:\\Users\\matti\\git\\ProgettoLube\\ProgettoLube\\WebInspector\\cookie_instagram.json', 'r',newline='') as input_data:
    cookies = json.load(input_data)
for i in cookies:
driver.add_cookie(i)
list_post = insta_profile.get_posts(webdriver=driver)

At the end the method returned a complete list of all posts from an instagram profile ! What you think about ? In my opinion this could be implemented in the library to get the process all automated and without manual login. P.S. Instagram in this way instagram does not get angry or suspicious

chris-greening commented 3 years ago

Thanks so much for posting your findings, I'm glad you were able to find a solution!

I'm certainly gonna look into this as this would open up new possibilities for the library. The only thing I would be worried about is Instagram checking the IP of the machine you're scraping from and realizing two machines in wildly different locations are using the same cookies; let me know if you run into problems with that! I'll do some digging myself

Again, awesome job finding a solution!

alireza2281 commented 2 years ago

Hello and thanks for your great job I just wondering if there is any further update about this problem and any other solutions (more automated solutions) rather than downloading login cookies and adding them to the driver to fetch all the posts