alekitto / class-finder

Utility classes to help discover other classes/namespaces
https://alekitto.github.io/class-finder/
MIT License
27 stars 5 forks source link

Separate `FileFinder` from the rest #20

Open oprypkhantc opened 3 months ago

oprypkhantc commented 3 months ago

Hey! Sorry for the long read: I wanted to provide some context for you to better understand the problem.

I have a package that depends on class-finder. It has multiple places where it iterates over FinderInterface. The namespaces are always the same, but it has to be iterated separately in multiple places.

For smaller projects, this is not a problem, as globbing and autoloading isn't that expensive for a couple hundred classes. In our case, the namespace we're scanning contains 10k+ classes, so the complete scan, along with autoloading and reflection, and done multiple times from different places, takes about 20sec.

Of course, the immediate solution is to simply cache the result of the FinderInterface. That works for a production environment, where files never change. For a development environment it's not that simple: usually only a few files change at a time, and doing a full scan every time is just not feasible. Which is why I'm instead implementing filemtime based caching: it uses class-finder's pathFilter feature to avoid autoloading/reflecting files that are known to not have changed.

The pathFilter solution works well if the FinderInteface is only iterated once. In our case, it's may be iterated over 6 times from different places and with different pathFilters. So a good optimization here would be to separate FileFinder from the rest of the code and allow specifying a custom instance of it when needed:

$cachedFileFinder = new CachedFileFinder(new GlobFileFinder(), new ArrayCache());

$finder1 = (new ComposerFinder())->useFileFinder($cachedFileFinder)->pathFilter(...);
$finder2 = (new ComposerFinder())->useFileFinder($cachedFileFinder)->pathFilter(...);

This way I would be able to cache the glob() that's happening inside the ComposerFinder.

In practise, this would mean adding a FileFinder interface and refactoring RecursiveIteratorTrait to use it. What do you think?

alekitto commented 3 months ago

Hi @oprypkhantc! Interesting use case, and I can see the problem in a development environment, but I don't think that separating the glob logic into an external finder is the best solution.

I see two problems here:

Probably what is missing here is a way to override the iterator(s) object (and maybe I should remove the final modifier from the iterator classes to allow subclassing). WDYT?

oprypkhantc commented 3 months ago

it applies only to a subset of finders and will not be usable on classmap-registered files and in production environments where an authoritative classmap should be generated by composer

Yes, so I'd expect it to only be available for "glob based" finders. An interface to denote "glob based" finders perhaps?

Probably what is missing here is a way to override the iterator(s) object (and maybe I should remove the final modifier from the iterator classes to allow subclassing). WDYT?

Well, I generally prefer composition over inheritance. In this case, that'd be easier for me - to implement just the CachingFinder, rather then extend both PSR0 and PSR4 finders. But that's just my preference; any solution will work for me :)

alekitto commented 3 months ago

Yes, so I'd expect it to only be available for "glob based" finders. An interface to denote "glob based" finders perhaps?

Not applicable, unfortunately: composer finder is not always glob-based, just under some conditions.

Well, I generally prefer composition over inheritance. In this case, that'd be easier for me - to implement just the CachingFinder, rather then extend both PSR0 and PSR4 finders. But that's just my preference; any solution will work for me :)

Yeah, not really a great solution... I need to think about it a little bit more

oprypkhantc commented 3 months ago

A little story of how I was debugging the performance:

Our app is almost 14k files. Just the $this->search() part (globing for all files) takes 2.5-3 seconds the first time. Even with additional caching, this initial scan will have to be done at least once per "boot". I'm thinking that's not good enough.

So I dig for a solution, a quicker glob(). There's none; the solution implemented by you is already the fastest possible solution. Then I run time (find modules -type f | wc -l) from the container to see if it's PHP being that inefficient - and it takes about 2 seconds to run.

I get suspicious and remember that we're running VirtioFS with Docker. I remove the mount, add a COPY instruction to the Dockerfile to make Docker use Linux-native filesystem and voila - the complete $this->search() for all of the files now takes 90-130 milliseconds :)

I'd still want to cache the result of $this->search(), as it may accumulate to almost a whole second if run multiple times. But it's still incredible how much performance is lost here.

oprypkhantc commented 3 months ago

Switching from VirtioFS is not easy, as there's no direct replacements that's free. So instead I was thinking of using a separate running binary to watch for file changes (thankfully these are fast) and writing a glob.json with the list of files that can then be loaded by PHP, to avoid an incredibly slow glob().

But that would also require overriding the search function of both finders. So I'm back to the custom FileFinder interface :)

oprypkhantc commented 3 months ago

So I did a couple of things in an attempt to further improve performance of the "second" iteration over the finder:

All this on 10k files with opcache enabled. The first iteration takes 6-7seconds (that is the glob() and autoloading/reflection), while all the consecutive calls only take ~20ms.

alekitto commented 3 months ago

Wow! Thanks for the huge research work!

I presume you're using macos, linux should not have these performance issues on mounted volumes.

Anyway, based on your last comment:

  • removed $info->isReadable() from both iterators, which cuts ~1 sec

We can probably get rid of this call, I don't think it is really useful anymore.

  • added a static cache for the result of $this->search(), which cuts ~2.5 sec
  • added a global cache for the result of class_exists(), which cuts another ~50-70ms

I was already thinking about a glob static cache, but I've discarded it as it wasn't a really good idea. Instead, I'm writing a cache interface to possibly serve all the caching purposes in the finders and the iterators. Ideally the implementation could accept a PSR-cache object and store/load the information from that. This way, you could be able to persist some information between requests.

I'll push a PR with this as soon as possible.

oprypkhantc commented 3 months ago

I presume you're using macos, linux should not have these performance issues on mounted volumes.

Yes :) And it's a constant battle against filesystem performance in Docker 🙃

We can probably get rid of this call, I don't think it is really useful anymore.

I'm guessing that call was useful for include_once(), so it can be moved inside the closure that calls include_once if you wish.

I was already thinking about a glob static cache, but I've discarded it as it wasn't a really good idea. Instead, I'm writing a cache interface to possibly serve all the caching purposes in the finders and the iterators. Ideally the implementation could accept a PSR-cache object and store/load the information from that. This way, you could be able to persist some information between requests.

That'd be awesome! Thank you :)

alekitto commented 3 months ago

Ok, I was working on the cache and found myself writing a class to abstract the glob logic, which is (I think) almost identical to your FileFinder class. So I began to think this is a pre-requisite to implement a viable cache solution.

Sorry that I understood this so late, after all this discussion. Do you mind to open a PR with your modifications?

oprypkhantc commented 3 months ago

Haha, no worries, it's your project in the end :)

I don't have modifications as a separate branch of class-finder, rather I have a simplified version of ComposerFinder, where I also combined PSR0/PSR4 iterators and some traits. It was fine for me to debug, as I didn't need most of the features that ComposerFinder offers, but obviously this is no good for a PR :(

I can still work on a PR though if you wish, just thought it'd make sense to notify you first.

alekitto commented 1 month ago

Hi @oprypkhantc, Sorry for the delay. I've just merged a commit that introduced a FileFinderInterface (with two implementations) that can be passed to the withFileFinder method on the finders.

When you have some time, could you please test it for your use-case and give me some feedback on it? Thanks

oprypkhantc commented 1 month ago

Hi @oprypkhantc,

Sorry for the delay. I've just merged a commit that introduced a FileFinderInterface (with two implementations) that can be passed to the withFileFinder method on the finders.

When you have some time, could you please test it for your use-case and give me some feedback on it?

Thanks

Thank you 🙏 Sorry for not participating in the making of the PR - I'm still busy with other tasks. I'll test it out.