pixml27 commented 4 years ago

First of all, thanks for this scrapper! My problem is that when I download a large number of posts (> 4000) with 5-10 comments for each post, chrome just crashes.

Initially, I got an error when opening uncollapsed comments (invalid session ID) Then I changed the code, set to open comments at the time of the scroll function, and the error began to appear there (invalid session ID again)

I read a lot of threads on the stackoverflow, they recommend adding some options to chrome, I tried it all. Also, many places offer to add memory to chrome (if using docker), but I just run the script It also seems to me that this problem is somehow related to memory, chrome closes due to too many images, media, etc. Can you help me somehow? Have you had this and have you tested the script on large amounts of information?

Снимок экрана 2020-08-20 в 01 09 02

brutalsavage commented 4 years ago

Unfortunately I haven't tested with large amount of posts. Chrome does have a lot of memory issues. If anyone has any solution feel free to comment.

webcoderz commented 4 years ago

heres some options including how to run it headless which should help from selenium.webdriver.chrome.options import Options chrome_options = Options()

chrome_options.add_argument("--disable-extensions")

chrome_options.add_argument("--disable-gpu")

chrome_options.add_argument("--no-sandbox") # linux only

chrome_options.add_argument("--headless") chrome_options.headless = True # also works driver = webdriver.Chrome(options=chrome_options)

brutalsavage commented 4 years ago

running it headless looks like a potential solution.

You should have chrome_options.add_argument("--disable-gpu") if you are running on windows

source: https://stackoverflow.com/questions/53657215/running-selenium-with-headless-chrome-webdriver

pixml27 commented 4 years ago

Thanks for advices, but i tried it all in different combinations Here are all options i founded closed to this problem, but it doesn't work

option.add_argument('--no-sandbox')

option.add_argument("--enable-automation")

#option.add_argument("start-maximized")

option.add_argument("--disable-extensions")

option.headless = True

option.add_argument('--disable-dev-shm-usage') 

# Pass the argument 1 to allow and 2 to block
option.add_experimental_option("prefs", {
    "profile.default_content_setting_values.notifications": 1,
    "profile.managed_default_content_settings.images": 2, 'disk-cache-size': 16000
})
option.add_experimental_option("excludeSwitches", ["enable-automation"])
option.add_experimental_option('useAutomationExtension', False)

P.S. I'm runnig on linux P.P.S. I was trying to emulate scrolling down the page in chrome using the RPA-platform (basically just pressing the down key, but without human intervention) And Facebook stopped sending new posts(or updating page) somewhere after 300 scrolls (or chrome stopped loading them) But the chrome did not fall So there may be a problem in Facebook protection, but then why does chrome fall in scraper....

brutalsavage / facebook-post-scraper

Invalid Session ID (Large amount of posts) #32

chrome_options.add_argument("--disable-extensions")

chrome_options.add_argument("--disable-gpu")

chrome_options.add_argument("--no-sandbox") # linux only