Open yohikofox opened 2 years ago
I had a look at the website -- it has Cloudflare running which protects from bots/scraping.
You might consider trying:
const Puppeteer = require("puppeteer");
const { captureHeapSnapshot, findObjectsWithProperties } = require("puppeteer-heap-snapshot");
const randomUseragent = require('random-useragent'); // npm install random-useragent
const start = async () => {
const browser = await Puppeteer.launch();
const page = await browser.newPage();
const agent = randomUseragent.getRandom();
await page.setUserAgent(`${agent}`);
await page.goto("https://www.sarenza.com");
let heapSnapshot = await captureHeapSnapshot(await page.target());
console.log('heapSnapshot:', findObjectsWithProperties(heapSnapshot, ['href']));
}
start();
Remember that headless puppeteer will let the requested server (aka requested url) know that it's a headless version of Chrome. Servers and services like Cloudflare can block this very easily.
Sorry for the delay on the reply here - I haven't had much time to look into this issue. From what I can see is that puppeteer-heap-snapshot
simply does not know how to de-serialize the Location
type from the heap snapshot. I'm happy to accept PRs if anyone wants to tackle understanding the datatype and de-serializing it. It's surprising that whatever this data is has its own data type as opposed to a primitive string or object.
You mean the Location
DOM object? It looks like src/build-object.ts does not handle any DOM objects, so I expect that it will fail on any DOM reference in JS. Vue/React apps are will not have this issue since they have a virtual DOM and can not* reference DOM elements directly.
Just call .toString()
on any unknown object is safe. It will give you [object NAME]
. If that is not useful then you have to have specific code for that object. You are also missing all of the JS objects. TypeArray
, Temporal
, Date
, Map
, Set
etc (maybe I missed some here).
* it's a simplification...
I agree w/ @dotnetCarpenter. I guess we can simply pass blacklisted object names to options to omit them from compiling matched graph nodes. I'm not sure but sounds like some sort of WebAPI/globals can be used as default list.
So after little tweak same code as above would give me proper result on "https://polypane.app/css-specificity-calculator/" (just some example of SPA which runs ontop of Gatsby):
[
{ children: 'privacy policy', href: '/privacy/' },
{ children: 'disclaimer', href: '/disclaimer/' },
{ children: 'Integrations', href: '/integrations/' },
{ children: 'Testimonials', href: '/testimonials/' },
{ children: 'Download', href: '/download/' },
{ children: 'Integrations', href: '/integrations/' },
{ children: 'Docs', href: '/docs/' },
{ children: 'Site Quality', href: '/site-quality/' },
{ children: 'Accessibility', href: '/accessibility/' },
{
children: 'Accessibility Statement',
href: '/accessibility-statement/'
},
{ children: 'Disclaimer', href: '/disclaimer/' },
{ children: 'Legal', href: '/legal/' },
{ children: 'Home', href: '/' },
{ children: 'All free tools', href: '/resources/' },
{
children: 'Responsive design glossary',
href: '/responsive-design-glossary/'
},
{ children: 'Create Polypane workspace', href: '/create-workspace/' },
{ children: 'Color contrast checker', href: '/color-contrast/' },
{ children: 'For Marketers', href: '/marketers/' },
{ children: 'For Agencies', href: '/agencies/' },
{ children: 'For QA', href: '/quality-assurance/' },
{ children: 'Pricing', href: '/pricing/' },
{ children: 'privacy policy', href: '/privacy/' },
{ children: 'Privacy', href: '/privacy/' },
{ children: 'disclaimer', href: '/disclaimer/' },
/* more results */
]
Hi,
I am trying to capture
href
s from the website below :with following code snippet :
I got this issue :
Thanks for replies.