apify / crawlee

Crawleeβ€”A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.2k stars 641 forks source link

Add an option to turn off `ow` in production #1470

Open mnmkng opened 2 years ago

mnmkng commented 2 years ago

Describe the feature Add either an env var, configuration option or some other way of turning off ow validations to improve performance. Maybe we could just turn it off if NODE_ENV=production

Motivation The validations are sometimes pretty slow.

Constraints Since we only validate public APIs, it shouldn't have any impact on stack traces in production.

B4nan commented 2 years ago

Maybe we could replace ow with something faster, I see @vladfrangu just used https://github.com/sapphiredev/shapeshift in the new memory storage, it claims to be blazing fast :P

@vladfrangu how does it compare to ow, do you think the migration would be easy'ish?

vladfrangu commented 2 years ago

Well you can see from apify/apify-ts#100 how the code looks. Its quite a speed up compared to ow (~1.2mil ops/second vs ow's ~250k ops/sec on my machine validating this type of object:

const dataType = s.object({
  number: s.number,
  negNumber: s.number,
  maxNumber: s.number,
  string: s.string,
  longString: s.string,
  boolean: s.boolean,
  deeplyNested: s.object({
    foo: s.string,
    num: s.number,
    bool: s.boolean,
  }),
});

).

Migration should be relatively straight forward and I can do that if you'd like. I also plan on PRing a change to shapeshift today to go from the shorthands to longer method names (right now things like greaterThan are typed as gt which, while its super fast to write, its confusing to read) and might even get it released today.

Of course, any validation is always gonna end up slower than no validation, so perhaps we could still consider a process env argument for it.. but that's your call. πŸ‘

mnmkng commented 2 years ago

There's this very magical validation in Request objects. Could you please try to recreate it with the new package and see how it performs against ow? Both with and without the commented optimization. https://github.com/apify/apify-ts/blob/22bb7cfadea8fd4a20086df2fe464ef4d573481d/packages/core/src/request.ts#L113-L128

I would still allow turning it off in production somehow. I imagine we could get some nice savings in high concurrency Cheerio scrapes.

B4nan commented 2 years ago

Let's measure, I am not against it, especially if we would migrate away from ow so we would have to touch all the places anyway.

Btw are we sure its a good idea to disable the validation for node_env=prod? This will have effect on the scrapers too, right? Those are developed directly on platform, so the validation does make sense to me there.

mnmkng commented 2 years ago

There's this ow shim to reduce bundle size. It would also be faster, most likely, you never know with proxies.

mnmkng commented 2 years ago

Btw are we sure its a good idea to disable the validation for node_env=prod? This will have effect on the scrapers too, right? Those are developed directly on platform, so the validation does make sense to me there.

That's interesting. Web Scraper has Development mode, so we could override the env var there. The other ones don't have that though.

vladfrangu commented 2 years ago

There's this very magical validation in Request objects. Could you please try to recreate it with the new package and see how it performs against ow? Both with and without the commented optimization.

https://github.com/apify/apify-ts/blob/22bb7cfadea8fd4a20086df2fe464ef4d573481d/packages/core/src/request.ts#L113-L128

I would still allow turning it off in production somehow. I imagine we could get some nice savings in high concurrency Cheerio scrapes.

Tried it out with this code: https://gist.github.com/vladfrangu/eb4e94ae0f3b568da47423236d0f7963

The speedup is there, but it definitely looks like it can be improved for the default object strategy in shapeshift... I have an idea for it, but I don't know if it'll actually speed anything up or not so I'll have to experiment with it (basically what you suggested where instead of checking the entire schema it checks every required field and if anything else is missing just exit early) πŸ˜…

image

vladfrangu commented 2 years ago

@mnmkng using your idea for validating objects, I applied it in shapeshift (PR https://github.com/sapphiredev/shapeshift/pull/101), and the speed up for the same input object and predicate as in the gist above is quite significant:

image

Thanks a lot for the idea/push! :D

mnmkng commented 2 years ago

Haha, nice πŸ˜€

mnmkng commented 2 years ago

Btw, the times are for one validation? Ouch, that's slow. Even for the fastest one πŸ˜…

vladfrangu commented 2 years ago

Do keep in mind those tests were done without proper benchmarking preparations (warming up functions by calling them repeatedly, etc). Sometimes the validation was as fast as 0.1ms. Still much faster than ow, and as fast as the faster checks we have in Request

mnmkng commented 2 years ago

Yeah, I was not complaining about the speed of the library πŸ˜‡ More making a further argument that we should probably turn off the validations in production.

matjaeck commented 1 year ago

Why is this additional validation tool needed in a TypeScript project?