diego-treitos / nginx_cache_multipurge

31 stars 12 forks source link

C based search helper #2

Closed iclukas closed 4 years ago

iclukas commented 4 years ago

Hi! I’ve deployed your script to an image server. It works fine, but we ran into a little trouble as with 50k cache files each wildcard purge operation took ~3-4 seconds. I’ve written a simple C program to speed up the process. It’s now down to 0.4-0.5 seconds in our case.

I can create a PR, but I’m not sure if this makes sense. Maybe a better idea would be a seperate project and some form of integration. What do you think?

diego-treitos commented 4 years ago

Hi @iclukas thanks a lot for coming back to the project with this information and your offer.

I wonder why there is such a big difference in time between your C program and the grep instruction. I don't know how did you integrate it, but I guess that you replaced the grep call with your own command.

If you are willing to share the code feel free to create a PR. I can look into integrating it.

iclukas commented 4 years ago

I made two optimisations, grep needs to scan every file in whole, except for the few matches. My program just looks at the first 512 bytes. Also, Regex is a bit overkill, so I’m just doing a simple byte by byte comparison.

I don't know how did you integrate it, but I guess that you replaced the grep call with your own command.

You are right and of course this should be optional. I’d suggest a new parameter to tell your script if the helper is present and where to find it.

The remaining issue is that the program needs to be compiled and put somewhere in the path. When I’m done with the Makefile, i’ll create the PR.

Any suggestions for a witty name? nginx_cache_keyfinder maybe?

diego-treitos commented 4 years ago

Actually thanks to the parameter -m1 the grep command stops reading after the first match so it should return after it finds the key. Maybe there is a bug in grep or it does not handle this very well (like reading files in big chunks).

The script is designed to be executed in runtime so checking if the program is present would require to check for every request, which is not optimal. On the other hand I don't think the wildcard purging should be stressed with many concurrent calls (although it depends on its use, of course).

Probably the best way to do it would be to add an option as a comment to switch from grep to your lightning_keyfinder so when anybody installs this, the change can be made by uncommenting the option. What do you think?

I believe nginx_cache_keyfinder is a very beautiful name :smile:

iclukas commented 4 years ago

Actually thanks to the parameter -m1 the grep command stops reading after the first match so it should return after it finds the key. Maybe there is a bug in grep or it does not handle this very well (like reading files in big chunks).

That’s true for the hits. But any miss (so 99.9% of the cache files) is scanned completely.

The script is designed to be executed in runtime so checking if the program is present would require to check for every request, which is not optimal. On the other hand I don't think the wildcard purging should be stressed with many concurrent calls (although it depends on its use, of course).

I agree that this would be bad design and it's better to make it configurable.

I just dug through the key/uri stuff and noticed that your original code can find keys like this: prefix*anything*suffix

Since I don’t need suffixes, my code doesn’t take this into account. This adds quite some complexity. 😐

diego-treitos commented 4 years ago

That’s true for the hits. But any miss (so 99.9% of the cache files) is scanned completely.

Well... this is embarrassing... I really didn't think about that. You are totally right. Actually in some quick tests, using something like "grep -Rasm1 '^KEY: '" ... cmp_cache_path ..." | grep '"... safe_grep_param ..."' | cut -d ':' -f1 | xargs -r rm -f" already executes more than twice as fast. Because first I get all the keys with the filenames and then I filter them.

I just dug through the key/uri stuff and noticed that your original code can find keys like this: prefix*anything*suffix

Since I don’t need suffixes, my code doesn’t take this into account. This adds quite some complexity. neutral_face

Honestly I coded this in 1 day or two and I did not touch it for months so I don't quite remember, but for what I see, the regular expressions are not checked unless the last character is a *. But yes, after passing that "barrier" you can pass pretty much anything that is accepted by grep.

Anyway that is an "undocumented feature" ;) so using just suffixes should be fine.

In any case I will also try to improve the grep version to work faster. Thanks again for bringing light over this.

iclukas commented 4 years ago

Hmm, I think I messed things up a bit in the Lua code. If a cache key doesn’t start with a /, the prefix will still have a slash. I think it would be safe just to strip the first char from cmp_uri and uri. On the other hand, this whole uri/key automatism seems to add a lot of complexity if you want anything different than standard behavior. Maybe there should be a second method where you can just set the key.