JupiterOne / playwright-aws-lambda

Support for running Microsoft's Playwright on AWS Lambda and Google Cloud Functions
MIT License
392 stars 53 forks source link

Struggling on aws lambda. #69

Closed imprisonedmind closed 11 months ago

imprisonedmind commented 1 year ago

I have a small express API that runs a scraper on Instagram that gets the ImgURL and descriptionText, I have moved this over to aws lambda as I was originally trying to run this on a Vercel serverless function. The function can run but takes longer than 10 seconds, which times out on the free version. Any help is appreciated., I keep timing out.

const playwright = require('playwright-aws-lambda');

exports.handler = async (event) => {
  const link = event.link;
  if (!link.includes('instagram')) {
    throw new Error('Not an Instagram link');
  }

  try {
    const browser = await playwright.launchChromium({headless: true});
    const context = await browser.newContext();

    const page = await context.newPage();
    await page.goto(link);

    const imgClass = 'img.x5yr21d.xu96u03.x10l6tqk.x13vifvy.x87ps6o.xh8yej3';
    const descClass = 'h1._aacl._aaco._aacu._aacx._aad7._aade';

    const image = await page.waitForSelector(imgClass);
    const imageUrl = await image.getAttribute('src')

    const desc = await page.waitForSelector(descClass);
    const descText = await desc.innerText()

    await browser.close();

    return {
      statusCode: 200,
      body: JSON.stringify({imageUrl, descText}),
    };
  } catch (error) {
    console.error(error);

    return {
      statusCode: 400,
      body: JSON.stringify({error: error.message}),
    };
  }
};

=

Packages are at:

  "dependencies": {
    "playwright-aws-lambda": "^0.10.0",
    "playwright-core": "^1.32.0"
  }

Aws lambda Node v16.x Memory 1600 x86_64

ashwaq06 commented 1 year ago

I am new to the open source contribution and I have knowledge on various AWS services. I have potential to work on this issue it would be great if this issue is assigned to me and Based on the information you provided, it seems that your Lambda function has enough memory allocated (1600 MB) and is using the latest version of Node.js (v16.x). However, you're using an older version of the playwright-aws-lambda package (v0.10.0), which may not have all the latest optimizations for running Playwright on AWS Lambda.

I recommend upgrading to the latest version of playwright-aws-lambda (v1.0.0 as of March 2023) and enabling the REUSE_BROWSER option to reuse the browser context across multiple function invocations. This can help reduce the startup time of your function and improve its overall performance.

imprisonedmind commented 1 year ago

So I changed up my configuration a bit a got a minimal version working, it seems I am timing out when trying to scrape the image src of an Instagram post, getting other information works fine.

Here is the repo: https://github.com/imprisonedmind/insta-scrape-api/tree/full-version

ashwaq06 commented 1 year ago

It's possible that the timeout issue is related to the performance of the scraping operation itself. Instagram can be quite complex, so it's possible that the scraping operation is taking longer than expected to complete.

One approach to addressing this issue would be to optimize the scraping operation itself. For example, you could try to narrow down the elements you are searching for to make the scraping process more efficient.

Another approach would be to increase the timeout value for your AWS Lambda function. Depending on your specific requirements, you may be able to increase the timeout value to give the scraping operation more time to complete. You can do this by going to the AWS Lambda console, selecting your function, and increasing the timeout value in the function configuration.

joshuabaker commented 1 year ago

@imprisonedmind Did you manage to get something working with this? I’ve got Pro on Vercel and am also getting timeouts.

shuhankuang commented 11 months ago

@ashwaq06 Where can I find version 1.0.0? It seems that there are no newer versions available.

imprisonedmind commented 11 months ago

@imprisonedmind Did you manage to get something working with this? I’ve got Pro on Vercel and am also getting timeouts.

No I went another route, I would consider running the environment in bun to see if that helps with speed.