Closed bdarcus closed 2 years ago
Unfortunately even though this is written in C, it still has O(n^2) behaviour because it implements the same algorithm as before. Even if it is 10x faster, that won't make up for the ~1000x slowdown from scanning every file in each directory repeatedly, once for each key.
The way around this would be to have a function that scans all the directories and constructs a hash table from keys to lists of associated filenames. Then, when formatting the candidates you would just look up each key in this hash table. This just has a ~2x slowdown, since you first scan all the files and then format all the candidates, but don't scan the files again for each candidate. I'll see if I can code that up this weekend.
The way around this would be to have a function that scans all the directories and constructs a hash table from keys to lists of associated filenames.
Is #457 then complementary to that, or a complete alternative?
I do think the code is cleaner, and perhaps it can be used to construct the hash?
If yes, then I'll merge it, and you can iterate on that.
It might make sense to use file-name-all-completions
when you're opening the file for a particular key. But I don't think it can be used to construct the hash table --- you don't have a key, but instead need to loop over all the files in a directory and use a regexp to extract the key from the filename.
Perhaps you could leave #457 unmerged, but I'll use the code from it? I'm not sure though, because performance is not that important then (you're only finding the files for a few keys at most) and it might be better to not have two different implementations of the file-matching logic for the two cases.
Perhaps you could leave #457 unmerged, but I'll use the code from it?
Sure. The main thing is splitting out the code into separate, simple, functions, that are easier to understand.
There's nothing special about it though.
Except it might be worth considering that one might want to use such functions outside of the context of formatting the candidates?
But I don't think it can be used to construct the hash table --- you don't have a key, but instead need to loop over all the files in a directory and use a regexp to extract the key from the filename.
Maybe I'm being dense, but why?
Look at the pseudocode for scanning the directory for associated files while generating the candidate strings:
for key in candidates: # ~2000 keys
for file in directory: # ~2000 files
does file start with key? # runs 4,000,000 times
This is a problem no matter how the inner loop is implemented. Even file-name-all-completions
internally has to loop over all files in the directory and check if they start with a particular prefix. The only way to avoid that inner loop is to just check for the existence of a small number of files, those named like "key.ext".
But if you do want to search for additional files, you have to avoid the nested loop by doing this:
for file in directory: # ~2000 files
key = key_from_filename(file)
# add (key, filename) to hash table
for key in candidates: # ~2000 keys
# check if key is present in hash table
Checking whether a key is in the hash table doesn't loop over all the entries, so this algorithm just has ~4000 operations, rather than the 4,000,000 we had before.
In this second implementation, you don't have a particular key for which you want to find all matching files (which would require that same inner loop as before). Instead, you have a filename and you need to figure out which key it belongs to.
Hope that makes sense.
So then the only way to be able to extract the key in that case is with a regex that assumes some separator (space, etc.).
Broader issue addressed via #460.
@localauthor @roshanshariff - I just stumbled on
file-name-all-completions
. I wonder if this would be at least theoretically faster for checking for file variants, since it's a built-in function written in C?Example from the docs:
Maybe something like this?
Any reason not to do this?