assaf / node-replay

When API testing slows you down: record and replay HTTP responses like a boss
http://documentup.com/assaf/node-replay
MIT License
522 stars 107 forks source link

Improvement/Question: Lower memory requirements for large fixtures folders of single host #150

Closed bisko closed 4 years ago

bisko commented 6 years ago

In one of the projects I'm working on, I'm using Node Replay to allow for offline testing of API ( what better way to make it extremely fast ). In our setup, the API is accessible through a single host, i.e. api.domain.com. Having to test a lot of cases means that we run the scripts over many different scenarios.

If I'm understanding the current process of saving fixtures and then using them correctly, it should be as follows:

Capturing requests

Using requests cache

Now this is OK if we have host folders that are small. When the host folder grows to be several hundred megabytes in size ( ours is ~10+ GB ), loading all the requests from the folder causes NodeJS to run out of memory.

The proposed fix in this PR changes a tiny bit how saving and loading the requests happen.

What's new?

Instead of using a randomly generated file name, I updated the code to use a reproducible name, based on the request's hash ( see getFileUidFromRequest method ). This way each request is saved in a different file. When loading the requests, instead of loading all the saved requests for a host, it loads just the request that's currently being captured, again based on the request cache.

This way the library only loads the requests that are needed from the filesystem, instead of loading all that it can find in the host folder.

Drawbacks

Now there are some drawbacks that this method introduces. A couple of them:

Comments, questions, suggestions are very welcome, so we can find a ground where this issue is resolved for both small and huge caches. Thanks! :)

assaf commented 6 years ago

It’s modular, instead of changing how the default catalog works, make a new catalog that uses request hashes, and then users can decide which catalog they want to use (or even both) based on their specific use case.

Many APIs include a signature/timestamp/nonce in the request, so you can’t necessarily tell what the URL would be and hash it. The default catalog can do partial matching, based on regular expressions, but for that it needs to load all request matches in memory, which means it’s only good for order of MBs.