cheeriojs / cheerio

The fast, flexible, and elegant library for parsing and manipulating HTML and XML.
https://cheerio.js.org
MIT License
28.44k stars 1.64k forks source link

Memory efficient cheerio.load #1343

Open JonathanMontane opened 5 years ago

JonathanMontane commented 5 years ago

Hi,

I am software engineer at Algolia and we love your library. However, we've encountered a pickle with certain documents we are trying to process, which just fail to be loaded because of the memory consumption of cheerio.load.

After doing some analysis, it seems that the cheerio.load will transform a 5MB file into a 150MB - 500MB memory representation. That's a x30 to x100 increase in size.

It would be awesome for us to have a more memory-efficient parser. I have looked into how the htmlparser2 library is used and it seems to me that it could be possible to have a more efficient representation of the elements, but I am not 100% sure how.

Could this type of constraint be something you consider for a future release? Thank you!

Code snippet used for measurements:

const cheerio = require('cheerio');

const generateWideFile = (siblings) => {
  const elts = `<div>Some Element</div>`.repeat(siblings);
  return `<html><body>${elts}</body></html>`;
}

const testMemoryPressure = () => {
  // const sizes = [100, 200, 300, 400, 500];
  // const sizes = [1000, 2000, 3000, 4000, 5000];
  // const sizes = [10000, 20000, 30000, 40000, 50000];
  const sizes = [100000, 200000, 300000, 400000, 500000];
  global.gc();
  const base = process.memoryUsage().heapUsed;

  const memory = sizes.map((size) => {
    global.gc();
    const html = generateWideFile(size);
    const $ = cheerio.load(html, {
      //_useHtmlParser2: true,
      decodeEntities: true,
      normalizeWhitespace: false,
      xmlMode: false,
    });
    global.gc();
    const usedSize = process.memoryUsage().heapUsed;
    $.html();
    const memory = usedSize - base;
    return { memory, size: html.length, ratio: Math.round(memory / html.length) };
  });

  console.log(memory);
}

testMemoryPressure();

cheerio version: 1.0.0-rc.3

5saviahv commented 3 years ago

This issue is untouched so long ... maybe it is related with #263 and that V8 bug in general?

myfreeer commented 2 years ago

Maybe related to https://github.com/cheeriojs/cheerio/pull/1960