apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.46k stars 664 forks source link

Few not matched social handles #525

Open metalwarrior665 opened 4 years ago

metalwarrior665 commented 4 years ago

https://www.linkedin.com/company/delegatus https://www.facebook.com/pages/category/Lawyer---Law-Firm/Delegatus-services-juridiques-inc-131011223614905/ https://www.facebook.com/pages/KinEssor-Groupe-Conseil/208264345877578

Reproduce:

(new RegExp('(?<!\\w)(?:http(?:s)?:\\/\\/)?(?:(?:[a-z]+\\.)?linkedin\\.com\\/in\\/)([a-z0-9\\-_%]{2,60})(?![a-z0-9\\-_%])(?:/)?')).test('https://www.linkedin.com/company/delegatus');

const FACEBOOK_RESERVED_PATHS = 'rsrc\\.php|apps|groups|events|l\\.php|friends|images|photo.php|chat|ajax|dyi|common|policies|login|recover|reg|help|security|messages|marketplace|pages|live|bookmarks|games|fundraisers|saved|gaming|salesgroups|jobs|people|ads|ad_campaign|weather|offers|recommendations|crisisresponse|onthisday|developers|settings|connect|business|plugins|intern|sharer';

const FACEBOOK_REGEX_STRING = `(?<!\\w)(?:http(?:s)?:\\/\\/)?(?:www.)?(?:facebook.com|fb.com)\\/(?!(?:${FACEBOOK_RESERVED_PATHS})(?:[\\'\\"\\?\\.\\/]|$))(profile\\.php\\?id\\=[0-9]{3,20}|(?!profile\\.php)[a-z0-9\\.]{5,51})(?![a-z0-9\\.])(?:/)?`;

(new RegExp(FACEBOOK_REGEX_STRING )).test('https://www.facebook.com/pages/category/Lawyer---Law-Firm/Delegatus-services-juridiques-inc-131011223614905/');
oklinov commented 1 month ago

Another one not matched: https://www.facebook.com/pages/U-Trumpety/394020690779011

'https://www.facebook.com/pages/U-Trumpety/394020690779011'.match(FACEBOOK_REGEX_STRING)