PipedreamHQ / pipedream

Connect APIs, remarkably fast. Free for developers.
https://pipedream.com
Other
8.32k stars 5.27k forks source link

Puppeteer / Playwright support #209

Closed dylburger closed 7 months ago

dylburger commented 4 years ago

Puppeteer provides a Node API to drive a Chrome headless browser. This allows you to programmatically visit sites, take screenshots, and more.

PIpedream currently doesn't support Puppeteer, but I'd like to run workflows that use it.

dylburger commented 4 years ago

Currently the Pipedream execution environment uses Node 10 with worker threads. The chrome-aws-lambda package that allows us to run Puppeteer with a small Chrome binary uses a dependency that requires process.umask() to be available, and it's not in the worker threads API for Node 10.

We've asked the Node team to consider backporting the stub of process.umask() they implemented for Node 11+ to Node 10, as well (see PR here), and we'll follow up as they get back to us.

lucasverra commented 4 years ago

what does it mean in less tech ? regarding timing of actual feasibility of the feature ?

dylburger commented 4 years ago

Hi @lucasverra , I'm not sure on the timing. I desperately want Puppeteer support myself, so I'm working to make sure we find a solution.

Now that you've commented on this thread, you should get its updates, so I'll make sure to keep you notified as we make progress.

lucasverra commented 4 years ago

Standing By

soyezcloud commented 4 years ago

just saw on PH this : https://puppet-master.sh/ , maybe a solution?

dylburger commented 4 years ago

@sauvegardezvous that's a very good idea. I think we could support driving a remote instance of Chrome before we support it locally. I'm looking into https://puppet-master.sh/ and https://docs.browserless.io/ , as well.

dylburger commented 4 years ago

@sauvegardezvous @lucasverra I added support for Puppet Master and Browserless, and built a couple of example workflows showing you how to use them to take a screenshot and save it to Amazon S3 (just an example):

https://pipedream.com/@dylburger/take-a-screenshot-with-puppet-master-save-to-s3-p_A2CNPD/readme https://pipedream.com/@dylburger/take-a-screenshot-with-browserless-save-to-s3-p_n1C2y6/readme

For Puppet Master, I added each of the parameters that you can pass to the screenshot API as form params for the Take a Screenshot action, so you can for example set fullPage to true to take a full page screenshot.,

Browserless provides a remote browser that you can connect to via websocket, so provides more programmatic control if you need to do more advanced work with Puppeteer, but it's a paid service, so you'll need to sign up for an API key.

I'm keeping this ticket open to track native Puppeteer support on Pipedream, just wanted to share those options in the meantime.

chrigi commented 3 years ago

https://puppet-master.sh/ seems to no longer exist, at least the website/service. The repo is still there: https://github.com/saasify-sh/puppet-master

Does the piedream integration support a self hosted version? If not, I guess the integration could be removed.

dylburger commented 3 years ago

@chrigi confirmed. I removed the Puppet Master app from our integrations. Have you tried Browserless? That's what we use internally.

bloycey commented 3 years ago

Puppeteer support would be amazing. Will follow along here for updates.

chrigi commented 3 years ago

I was mainly looking for an easy way to scrape an SPA but later noticed that even if Pupper-Master was still around it seems like it only supported taking a screenshot (I think?) and didn't want to pay for Browserless, it's not that important to me. Luckily I noticed the hydration function was easy to find and I could just extract and execute that on Pipedream and parse the resulting datastructure, so no Puppeteer necessary for the moment.

justinr1234 commented 3 years ago

Can confirm I need this as well. Was trying to find a workaround, but haven't been able to come up with anything.

ashutoshsaboo commented 3 years ago

Hello, Can we add support for playwright - https://github.com/microsoft/playwright - too? There's a different package for running it on AWS lambda - https://github.com/JupiterOne/playwright-aws-lambda . The readme mentions that it's based on chrome-aws-lambda but i'm not sure if it needs the same process.umask for Node 10 that you mention above.

@dylburger If this can work without, can we support this rather?

dylburger commented 3 years ago

@ashutoshsaboo I tested again and neither the native Playwright package, nor playwright-aws-lambda, work out of the box on Pipedream. When we tackle this larger issue, we'll investigate support for Playwright, as well (I've updated the issue title to reflect that).

gaelollivier commented 3 years ago

I managed to make it work using https://www.browserless.io and puppeteer-core (puppeteer fails to install because it tries to download chromium):

const puppeteer = require('puppeteer-core');

let browser;

try {
  browser = await puppeteer.connect({
    browserWSEndpoint: 'wss://chrome.browserless.io',
  });
  const page = await browser.newPage();

  await page.goto('https://www.example.com');

  const res = await page.evaluate(() => document.querySelector('h1').innerHTML);

  return res;
} catch (error) {
  return null;
} finally {
  if (browser) {
    await browser.close();
  }
}
sjn001tvh commented 2 years ago

Thanks Dylan for pointing me to this thread. I appreciate that I'm not the only one who would like to have a browserless solution to open a webhook to a browser. In developing one of the workflow steps, I'm needing to create a UI to input and manipulate information before I submit a project record into ClickUp. In using Dropbox as a trigger mechanism, I use some of the steps to extract data from ClickUp, and process it to an html file containing dropdown lists that I can select from my phone with minimal typing. I distain inputting text information from my phone using my index finger to type in the information. Brief entries are acceptable, but descriptions of projects, etc. and trying to maintain consistency for project types, which can get out of hand quickly by entering them in by hand, without having them in a list, causes extra work that shouldn't be necessary. I was hoping that the Puppeteer would be able to solve this problem because it was free. I guess I was just being cheap. I've already signed up for browserless.io to pay as you go. The rates aren't that bad for me, in view that I probably will use the service more during the initial testing phase, than I will when the actual workflow has been completed.

Rutledge commented 2 years ago

+1 for puppeteer native support. Hopefully more than a pipedream :)

sirloinofbeef commented 2 years ago

Following

Mind-Reader commented 2 years ago

+1

ruie commented 2 years ago

+1

tomrob765 commented 2 years ago

+1 (!)

Can't wait to do this straight from Pipedream! My workaround is using the API from this site: https://htmlcsstoimage.com/ - which works quite well to their credit

ctrlaltdylan commented 1 year ago

Just an update on the playwright/puppeteer support. We have recently increased our memory limits up to 2GB per workflow, which opens up the possibility of installing puppeteer and the large Chromium binary it's packaged with.

However, we're running into issues loading the Chromium binary at the time of the step's execution. But, we have one constraint solved.

rubentorresbonet commented 1 year ago

Hey, any update regarding running the binaries? when trying playwright there is an error: Executable doesn't exist at /home/sbx_user1051/.cache/ms-playwright/chromium-1019/chrome-linux/chrome

dylburger commented 1 year ago

@rubentorresbonet Not yet. We are likely to provide the Chrome binary ourselves since it requires some special tweaks to work in our execution environment. This isn’t on our immediate backlog but we’ll update you here when we do that work.

Did you check out Browserless or other remote browser services? Curious if that would work for your use case.

restyler commented 1 year ago

Hey guys I have recorded a short video on how Pipedream can be used to run headless Chrome API . ScrapeNinja API, which is used in the video, is not exactly a generic Puppeteer instance, it has two API endpoints, /scrape and /scrape-js, where /scrape-js is essentially a "dumbed down" version of headless Chrome, which is much more convenient to use, as it is targeted at specific use case of web scraping websites. If your use case is not web scraping related, ScrapeNinja is probably not for you and the video will not be relevant.

Building low code web scraper to extract Hackernews titles to Google Sheets, via Pipedream: https://www.youtube.com/watch?v=uBC752CWTew

cshape commented 1 year ago

Following! It's nice that you can scrape with other methods but I've got some stuff built with Puppeteer and it would be great to just copy/paste into Pipedream.

eyalway2cu commented 1 year ago

Would also love to have Puppeteer support in Pipedream :).

commenting to get updates :)

parisetflorian commented 11 months ago

+1

sam-frampton commented 9 months ago

+1

ctrlaltdylan commented 7 months ago

Pipedream now supports Playwright and Puppeteer natively in your Node.js code steps & pre-built actions natively! No 3rd party APIs are required to issue commands a Chromium remotely.

Beyond retrieving HTML content, you can perform actions like clicking on elements, rendering Javascript on SPA sites, generating screenshots, capturing PDFs.

Simply import @pipedream/browsers within your Node.js code running on Pipedream to install & run a Puppeteer or Playwright browser and launch a Puppeteer or Playwright browser.

Or just use the pre-built actions for Puppeteer & Playwright to get HTML content, take a screenshot or generate a PDF for a webpage.

Here’s an example running a Puppeteer instance to scrape HTML from a webpage:

import { puppeteer } from '@pipedream/browsers';

export default defineComponent({
  async run({steps, $}) {
    const browser = await puppeteer.browser();

    // Interact with the web page programmatically
    const page = await browser.newPage();

    await page.goto('https://pipedream.com/');
    const title = await page.title();
    const content = await page.content();

    await browser.close();

    return { title, content }'
  },
})

Read the docs here: https://pipedream.com/docs/code/nodejs/browser-automation/