JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.43k stars 706 forks source link

Facebook Scraper fails (on user and posts) #960

Closed nikbpetrov closed 1 year ago

nikbpetrov commented 1 year ago

Describe the bug

Facebook Scraper currently fails - I've tested trying to return a user (return an error during HTML parsing) and posts (generator returns nothing).

I suspect the issue is bigger, possibly due to changes with the FB profile, but could also be some other connection issue.

How to reproduce

import snscrape.modules.facebook as snfb
import logging
logging.basicConfig(level=logging.DEBUG)

user = snfb.FacebookUserScraper("zuck")._get_entity()

This returns an error.

Similarly,

import snscrape.modules.facebook as snfb
import logging
logging.basicConfig(level=logging.DEBUG)

for i,post in enumerate(snfb.FacebookUserScraper("zuck").get_items()):
    print(i, post)

This returns nothing (except the log output from snscrape - the log output is the same as for the above command and is pasted below).

Expected behaviour

Successfully returning User or Post information.

Screenshots and recordings

No response

Operating system

Windows 10 Pro, Version: 21H2, OS build: 19044.2965

Python version: output of python3 --version

3.10.11

snscrape version: output of snscrape --version

0.6.2.20230320

Scraper

facebook-user

How are you using snscrape?

Module (import snscrape.modules.something in Python code)

Backtrace

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[1], line 5
      2 import logging
      3 logging.basicConfig(level=logging.DEBUG)
----> 5 user = snfb.FacebookUserScraper("zuck")._get_entity()
      6 user

File [d:\Anaconda3\envs\smscraping2\lib\site-packages\snscrape\modules\facebook.py:237](file:///D:/Anaconda3/envs/smscraping2/lib/site-packages/snscrape/modules/facebook.py:237), in FacebookUserScraper._get_entity(self)
    234     return
    236 handleDiv = handleDivPattern.search(r.text)
--> 237 handle = handlePattern.search(handleDiv.group(0))
    238 kwargs['username'] = handle.group(1)
    240 nameVerifiedMarkup = nameVerifiedMarkupPattern.search(r.text)

AttributeError: 'NoneType' object has no attribute 'group'

Log output

INFO:snscrape.modules.facebook:Retrieving initial data
INFO:snscrape.base:Retrieving https://www.facebook.com/zuck/
DEBUG:snscrape.base:... with headers: {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; rv:78.0) Gecko/20100101 Firefox/78.0', 'Accept-Language': 'en-US,en;q=0.5'}
DEBUG:snscrape.base:... with environmentSettings: {'proxies': OrderedDict(), 'stream': False, 'verify': True, 'cert': None}
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.facebook.com:443
DEBUG:snscrape.base:Connected to: ('157.240.9.35', 443)
DEBUG:snscrape.base:Connection cipher: ('TLS_CHACHA20_POLY1305_SHA256', 'TLSv1.3', 256)
DEBUG:urllib3.connectionpool:[https://www.facebook.com:443](https://www.facebook.com/) "GET /zuck/ HTTP/1.1" 200 None
INFO:snscrape.base:Retrieved https://www.facebook.com/zuck/: 200
DEBUG:snscrape.base:... with response headers: {'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'critical-ch': 'Sec-CH-UA-Model', 'accept-ch-lifetime': '4838400', 'accept-ch': 'Sec-CH-Prefers-Color-Scheme,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Platform-Version', 'Link': '<https://www.facebook.com/zuck>; rel="canonical"', 'report-to': '{"max_age":86400,"endpoints":[{"url":"https:\\/\\/www.facebook.com\\/browser_reporting\\/?minimize=0"}],"group":"coep_report"}, {"max_age":259200,"endpoints":[{"url":"https:\\/\\/www.facebook.com\\/ajax\\/comet_error_reports\\/?device_level=unknown"}]}', 'x-fb-rlafr': '0', 'content-security-policy': "default-src data: blob: 'self' https://*.fbsbx.com 'unsafe-inline' *.facebook.com *.fbcdn.net 'unsafe-eval';script-src *.facebook.com *.fbcdn.net 'unsafe-inline' blob: data: 'self' 'unsafe-eval';style-src *.fbcdn.net data: *.facebook.com 'unsafe-inline';connect-src *.facebook.com facebook.com *.fbcdn.net wss://*.facebook.com:* wss://*.fbcdn.net attachment.fbsbx.com blob: *.cdninstagram.com 'self' http://localhost:3103 wss://gateway.facebook.com wss://edge-chat.facebook.com wss://snaptu-d.facebook.com wss://kaios-d.facebook.com/ *.fbsbx.com;font-src data: *.facebook.com *.fbcdn.net *.fbsbx.com;img-src *.fbcdn.net *.facebook.com data: https://*.fbsbx.com facebook.com *.cdninstagram.com fbsbx.com fbcdn.net blob: android-webview-video-poster: *.oculuscdn.com;media-src *.cdninstagram.com blob: *.fbcdn.net *.fbsbx.com www.facebook.com *.facebook.com data:;frame-src *.facebook.com *.fbsbx.com fbsbx.com data: *.fbcdn.net;worker-src blob: *.facebook.com data:;block-all-mixed-content;upgrade-insecure-requests;report-uri https://www.facebook.com/csp/reporting/?m=c&minimize=0;", 'document-policy': 'force-load-at-top', 'permissions-policy': 'accelerometer=(), ambient-light-sensor=(), bluetooth=(), gyroscope=(), hid=(), idle-detection=(), magnetometer=(), midi=(), payment=(), screen-wake-lock=(), serial=(), usb=()', 'cross-origin-resource-policy': 'same-origin', 'cross-origin-embedder-policy-report-only': 'require-corp;report-to="coep_report"', 'cross-origin-opener-policy': 'same-origin-allow-popups', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, must-revalidate', 'Expires': 'Sat, 01 Jan 2000 00:00:00 GMT', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'X-Frame-Options': 'DENY', 'Strict-Transport-Security': 'max-age=15552000; preload', 'Content-Type': 'text/html; charset="utf-8"', 'X-FB-Debug': '/ORjyVwmFDIk8PHauQajIYbj3Fnm17bmXDYVU3xl4b51HHEm3cq8VFMf+UWR5N5/DkXR1MrgjAA/YnDY+iqPAA==', 'Date': 'Tue, 06 Jun 2023 09:19:15 GMT', 'Alt-Svc': 'h3=":443"; ma=86400', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive'}
DEBUG:snscrape.base:https://www.facebook.com/zuck/ retrieved successfully

Dump of locals

No response

Additional context

Obviously, the HTML parsing fails.

The current problem seems to be downstream from the _initial_page() result, which is response.text and soup (essentially). Both these variables, for me, return a logged-out user's view with a prompt to accept cookies. Even if cookies are accepted, I think, still, little information is displayed (e.g. no posts).

Not sure if this is a problem with the connection or somehow the result of some changes to the FB platform. Either way, if you provide hints on how I can help with resolving this, I'd be happy to!

JustAnotherArchivist commented 1 year ago

523