adriancooney / puppeteer-heap-snapshot

API and CLI tool to fetch and query Chome DevTools heap snapshots.
MIT License
1.35k stars 68 forks source link

Unknown or unsupported object with type 'Location' #1

Open yohikofox opened 2 years ago

yohikofox commented 2 years ago

Hi,

I am trying to capture hrefs from the website below :

https://www.sarenza.com

with following code snippet :

const Puppeteer = require("puppeteer");
const { captureHeapSnapshot, findObjectsWithProperties } = require("puppeteer-heap-snapshot");

const start = async () => {
    const browser = await Puppeteer.launch();
    const page = await browser.newPage();

    await page.goto("https://www.sarenza.com");

    let heapSnapshot = await captureHeapSnapshot(await page.target());

    console.log('heapSnapshot:', findObjectsWithProperties(heapSnapshot, ['href']));
}

start();

I got this issue :

(node:38964) UnhandledPromiseRejectionWarning: Error: Unknown or unsupported object with type 'Location'
    at compileGraphNodeObject (C:\ws\white-label\code\test\pupeteer-heap-snapshot\node_modules\puppeteer-heap-snapshot\dist\cjs\src\build-object.js:75:19) 
    at buildObjectFromNodeId (C:\ws\white-label\code\test\pupeteer-heap-snapshot\node_modules\puppeteer-heap-snapshot\dist\cjs\src\build-object.js:34:12)  
    at C:\ws\white-label\code\test\pupeteer-heap-snapshot\node_modules\puppeteer-heap-snapshot\dist\cjs\src\query.js:16:57
    at Array.map (<anonymous>)
    at findObjectsWithProperties (C:\ws\white-label\code\test\pupeteer-heap-snapshot\node_modules\puppeteer-heap-snapshot\dist\cjs\src\query.js:14:20)     
    at start (C:\ws\white-label\code\test\pupeteer-heap-snapshot\index.js:16:34)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
(Use `node --trace-warnings ...` to show where the warning was created)
(node:38964) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:38964) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Thanks for replies.

jmitchel3 commented 2 years ago

I had a look at the website -- it has Cloudflare running which protects from bots/scraping.

You might consider trying:

const Puppeteer = require("puppeteer");
const { captureHeapSnapshot, findObjectsWithProperties } = require("puppeteer-heap-snapshot");
const randomUseragent = require('random-useragent'); // npm install random-useragent

const start = async () => {
    const browser = await Puppeteer.launch();
    const page = await browser.newPage();
    const agent = randomUseragent.getRandom();
    await page.setUserAgent(`${agent}`);

    await page.goto("https://www.sarenza.com");

    let heapSnapshot = await captureHeapSnapshot(await page.target());

    console.log('heapSnapshot:', findObjectsWithProperties(heapSnapshot, ['href']));
}
start();

Remember that headless puppeteer will let the requested server (aka requested url) know that it's a headless version of Chrome. Servers and services like Cloudflare can block this very easily.

adriancooney commented 2 years ago

Sorry for the delay on the reply here - I haven't had much time to look into this issue. From what I can see is that puppeteer-heap-snapshot simply does not know how to de-serialize the Location type from the heap snapshot. I'm happy to accept PRs if anyone wants to tackle understanding the datatype and de-serializing it. It's surprising that whatever this data is has its own data type as opposed to a primitive string or object.

dotnetCarpenter commented 2 years ago

You mean the Location DOM object? It looks like src/build-object.ts does not handle any DOM objects, so I expect that it will fail on any DOM reference in JS. Vue/React apps are will not have this issue since they have a virtual DOM and can not* reference DOM elements directly.

Just call .toString() on any unknown object is safe. It will give you [object NAME]. If that is not useful then you have to have specific code for that object. You are also missing all of the JS objects. TypeArray, Temporal, Date, Map, Set etc (maybe I missed some here).

* it's a simplification...

Nedgeva commented 2 years ago

I agree w/ @dotnetCarpenter. I guess we can simply pass blacklisted object names to options to omit them from compiling matched graph nodes. I'm not sure but sounds like some sort of WebAPI/globals can be used as default list.

So after little tweak same code as above would give me proper result on "https://polypane.app/css-specificity-calculator/" (just some example of SPA which runs ontop of Gatsby):

[
  { children: 'privacy policy', href: '/privacy/' },
  { children: 'disclaimer', href: '/disclaimer/' },
  { children: 'Integrations', href: '/integrations/' },
  { children: 'Testimonials', href: '/testimonials/' },
  { children: 'Download', href: '/download/' },
  { children: 'Integrations', href: '/integrations/' },
  { children: 'Docs', href: '/docs/' },
  { children: 'Site Quality', href: '/site-quality/' },
  { children: 'Accessibility', href: '/accessibility/' },
  {
    children: 'Accessibility Statement',
    href: '/accessibility-statement/'
  },
  { children: 'Disclaimer', href: '/disclaimer/' },
  { children: 'Legal', href: '/legal/' },
  { children: 'Home', href: '/' },
  { children: 'All free tools', href: '/resources/' },
  {
    children: 'Responsive design glossary',
    href: '/responsive-design-glossary/'
  },
  { children: 'Create Polypane workspace', href: '/create-workspace/' },
  { children: 'Color contrast checker', href: '/color-contrast/' },
  { children: 'For Marketers', href: '/marketers/' },
  { children: 'For Agencies', href: '/agencies/' },
  { children: 'For QA', href: '/quality-assurance/' },
  { children: 'Pricing', href: '/pricing/' },
  { children: 'privacy policy', href: '/privacy/' },
  { children: 'Privacy', href: '/privacy/' },
  { children: 'disclaimer', href: '/disclaimer/' },
  /* more results */
]