brutalsavage / facebook-post-scraper

Facebook Post Scraper 🕵️🖱️
GNU General Public License v3.0
324 stars 116 forks source link

Fixed _extract_post_id returning null values caused by multiple items in page with class _5pcq #21

Open ongzexuan opened 4 years ago

ongzexuan commented 4 years ago

_extract_post_id uses the class _5pcq to get the postId. This sometimes fails when there is a long list of items with that class, some of which have an effectively empty href tag. Propose a fix where we check if the href tag contains a URL (beginning with '/') before extracting.

Sample list of prints from item.find_all(class_="_5pcq") below. The last item on the list has '#' in the href field. This causes the resultant post_id to be '#'.

To replicate: python scraper.py -p TheStraitsTimes -l 1

<a class="_5pcq" href="/TheStraitsTimes/posts/10157114673327115?__xts__%5B0%5D=68.ARAQBh0K_NMj_mQAANUH_3XvHEDd3zLc83FLEcu4VcfDAdkM6z1PAP4Izat-cL4tQmNTMr_W875cfYO3vqYneCqXcjuRt9Q1tiYK64NKoaEUtHoyIyAjcZi6jHtUrCB60YZfPvwidqL6Aw6Vm7yIdE7amIjP-yTjI25iMi-EH7xYHzCLxG1U83eUuG-L4xX73BaqcA8MtjD6aeI-EFfelvwRVHDV5GlwwgN2cGDrcv5_--KTGPV8mNO9UFtcj4BdxBG45bb4QZrpTE-PxmdnjHAIjbauy89o3zXPRG8t5LsfThBfy5UYs0M3PcVsiJi8UJswS-_QJDDTwFMnozEp&amp;__tn__=-R" target=""><abbr class="_5ptz timestamp livetimestamp" data-shorten="1" data-utime="1591800911" title="Wednesday, June 10, 2020 at 7:55 AM"><span class="timestampContent" id="js_6">1 hr</span></abbr></a>

<a aria-label="Public" class="uiStreamPrivacy inlineBlock fbStreamPrivacy fbPrivacyAudienceIndicator _5pcq" data-hover="tooltip" data-tooltip-content="Public" href="#" role="img"><i class="lock img sp_sG2S1OTONin sx_b0665c"></i></a>

<a class="_5pcq" href="/TheStraitsTimes/posts/10157114743227115?__xts__%5B0%5D=68.ARCAUPaCRHFNlhpvP2W3jDKjTebqzmTZplSSOjw7Q6sLY5VjEDPitgFQ1kYPbbGkhEiMNdN4ZLR2BjaCFdWQe5V3pDbTZ73LXDRBsFjuGX_WX2BFnx0r1xDjP2OXiYNx9B1YJOEmnVbPJg6M817WmCRTUmSjsCECgHKDAaLin8z7bP3s0XjTqaEXxtmINF7Beqwi4lqMhx8D8HQG5rgZqFzCjMOpo8s_glZV36SHwX1z2fFLpF4iudosAK-005XvhBBIfs66n5UZe9AsQmvd0QsMbjfVQIN_JqGY4-mn8VjW8XjZRzBKFEUCir2efcX5bAitc0MVnQ2Fdn0Pdzyj&amp;__tn__=-R" target=""><abbr class="_5ptz timestamp livetimestamp" data-shorten="1" data-utime="1591803018" title="Wednesday, June 10, 2020 at 8:30 AM"><span class="timestampContent" id="js_9">1 hr</span></abbr></a>

<a aria-label="Public" class="uiStreamPrivacy inlineBlock fbStreamPrivacy fbPrivacyAudienceIndicator _5pcq" data-hover="tooltip" data-tooltip-content="Public" href="#" role="img"><i class="lock img sp_sG2S1OTONin sx_b0665c"></i></a>