Function get_posts return always 12 posts

Matteoo98 commented 3 years ago

Hi, sorry for bothering you again. As you see in the title each time i run get_posts on a instagram profile it return always 12 posts. My code is : sessionid = '********************

    lista = []
    # headers = {"User-Agent": "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
    # "cookie": f"sessionid={os.environ.get('sessionid')};"}
    headers = {
        "User-Agent": "user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
        "cookie": f"sessionid={sessionid};"}
    options = Options()
    options.add_argument('--headless')
    driver = webdriver.Chrome(options=options)
    target = 'https://www.instagram.com/molteni_matteo/'
    # target = 'https://www.instagram.com/lubecreopratolacasteldisangro/'
    insta_profile = Profile(target)
    insta_profile.scrape(headers=headers)
    insta_profile.url = target
    list_post = insta_profile.get_posts(webdriver=driver,amount=13)
    print(insta_profile.followers)
    print("Numero post : " + str(len(list_post)))
    for profile_post in list_post:
        profile_post.scrape(headers=headers)
        lst = get_image_urls(profile_post)
        # for y in lst:
        # print("LINK INSIDE POST : ", y)
        # html = profile_post.embed()
        # soup = BeautifulSoup(html, "html.parser")
        # href = None
        # for a in soup.find_all('a', href=True):
        # href = a['href']
        # break
        # sep = '?'
        # href = href.split(sep, 1)[0]
        # href = href+"?__a=1"
        # print('href : ', href)
        # response = requests.get(href,headers=headers).json()
        # json_data = json.loads(response.text)
        # print(response)
        # print(response.get('edge_sidecar_to_children'))
        post_dict = profile_post.to_dict(metadata=False)
        post_dict['images_links'] = lst
        lista.append(post_dict)
    for x in lista:
        print(x)
    print('fine')`

list_post = insta_profile.get_posts(webdriver=driver,amount=13) Even if i put None on amount or i don't specify amount it gives me always 12 posts. How can i solve this problem ? And sorry to disturb you again.

tarob0ba commented 3 years ago

Hm... this is a tough one. While get_recent_posts can only return up to 12, you're using plain get_posts, so that shouldn't be a problem. I'll see if I can replicate this. Time to get use the webdriver for the first time. 😄

By the way, you're not disturbing anyone here. 🙂

ping: @chris-greening

Matteoo98 commented 3 years ago

Thank you very much, i hope you'll find a solution because i really need this library for a project. Let me know as soon as you can if you find a solution to this problem 🙏 .

tarob0ba commented 3 years ago

I hate to say this, but I'm helpless right now because I don't have an Instagram account and I can't confirm that a change "works" because I get an InstagramLoginRedirectError 100% of the time. I'll keep looking into the source, but you're probably going to have to wait for Chris to help. Sorry for any inconvenience, but I'll keep trying.

Matteoo98 commented 3 years ago

In fact i have created a fake instagram account for my purpose because without an account the library doesn't work. Anyway thanks for your help and don't worry , i will wait the help of @chris-greening .

chris-greening commented 3 years ago

No disturbance at all!

You'll have to login manually once the webdriver opens. get_posts has a login_first parameter that will take you to the login page for 60 seconds prior to scraping if you pass it as True. Try modifying your get_posts to this line of code

    list_post = insta_profile.get_posts(webdriver=driver, amount=13, login_first=True)

I also recommend adding a time.sleep for ~10 seconds between each scrape if you're able! I know it's inconvenient but Instagram started blocking me after a couple hundred scrapes if I didn't pause for a bit between each one.

Matteoo98 commented 3 years ago

This could be a very problem for me because i need an autonomous software and i can't login manually... Is there a way to avoid the manually login ? Or if it is mandatory how can i pass username and password to the driver to do it automatically ? Is not possible to use the header with the session id like in the scrape method ? I have read that with selenium i can compile the form passing username and password and click the button login. What do you think about implement in the library this possibility ?

driver.get(getdriver) driver.find_element_by_xpath("//input[@name='username']").send_keys(username) driver.find_element_by_xpath("//input[@name='password']").send_keys(password) driver.find_element_by_xpath("//button[contains(.,'Log in')]").click()

if i could give username and password to the get_post method which will contains the above lines of code maybe it could be achieved ?

chris-greening commented 3 years ago

I've thought about implementing an automatic login script but the problem with simply getting the elements and sending the keys is that Instagram will know pretty easily that you're botting. They will likely require some sort of 2FA to verify you're a human using the platform and I've even heard of people's accounts getting temporarily suspended.

Unfortunately get_posts needs more than just the session ID because it's opening a browser and that browser needs to get authenticated with a unique session ID from Instagram which can only be acquired through logging in.

There are a couple things you can try though!

There are other libraries such as InstaPy that I believe emulate a human logging in to successfully bypass Instagram's authentication. I haven't checked their status in a while though so you'll have to verify it still works. If it does, you could hack together a script that logs in and then use that logged in webdriver instance as your webdriver for scraping the posts. (more info)
you could also try persisting the browser instance across scrapes. I have never done this and don't know the details but perhaps there's a way you could reuse the same logged in webdriver across multiple runs of the script. You would only have to manually login once and leave the connection open and then reuse that connection in a future scrape. There might be a way to serialize/deserialize a persistent webdriver instance through pickle or something similar. That's total speculation though so you'd have to research that on your own.
In the same vein as above, you could try an infinite loop and just always have the program running in the background and it'll persist the same webdriver, just be careful with memory management as it could eat up RAM if you've got unnecessary data just floating around that isn't getting garbage collected

Unfortunately, I haven't really messed around with autonomous logging in so most of this is speculation. Webdriver support as well as instascrape itself is young so it could be a while, if ever, that I get around to supporting it OOTB, definitely check out InstaPy though as they're a significantly more mature lib and I'm pretty sure they have some form of login support that you might be able to implement.

Let me know if you figure it out! I'll try to do some more research on my own but I don't know if I have the time to realize a full solution

Matteoo98 commented 3 years ago

Ok thanks for your help , i think the problem is going to be bigger than my ability but thank you for all 💯

Matteoo98 commented 3 years ago

Hi @chris-greening , after a lot of work i finally realized my purpose. I was able to use the get_post method to get all posts on a page and not just 12. What I did was use this extension [https://chrome.google.com/webstore/detail/cookie-editor/hlkenndednhfkekhgcdicdfddnkalmdm]() to get the json of the cookies of a session opened on instagram where I was logged in (but could be implemented with pickle). I saved the json on the project directory and then I put the json inside the driver that uses the get_post method so it used a session where I was already authenticated. This is the code :

driver.get('https://www.instagram.com')
with open('C:\\Users\\matti\\git\\ProgettoLube\\ProgettoLube\\WebInspector\\cookie_instagram.json', 'r',newline='') as input_data:
    cookies = json.load(input_data)
for i in cookies:
driver.add_cookie(i)
list_post = insta_profile.get_posts(webdriver=driver)

At the end the method returned a complete list of all posts from an instagram profile ! What you think about ? In my opinion this could be implemented in the library to get the process all automated and without manual login. P.S. Instagram in this way instagram does not get angry or suspicious

chris-greening commented 3 years ago

Thanks so much for posting your findings, I'm glad you were able to find a solution!

I'm certainly gonna look into this as this would open up new possibilities for the library. The only thing I would be worried about is Instagram checking the IP of the machine you're scraping from and realizing two machines in wildly different locations are using the same cookies; let me know if you run into problems with that! I'll do some digging myself

Again, awesome job finding a solution!

alireza2281 commented 2 years ago

Hello and thanks for your great job I just wondering if there is any further update about this problem and any other solutions (more automated solutions) rather than downloading login cookies and adding them to the driver to fetch all the posts

chris-greening / instascrape

Function get_posts return always 12 posts #79