emacs-citar / citar

Emacs package to quickly find and act on bibliographic references, and edit org, markdown, and latex academic documents.
GNU General Public License v3.0
512 stars 55 forks source link

Use file-name-all-completions for find-additional? #456

Closed bdarcus closed 2 years ago

bdarcus commented 2 years ago

@localauthor @roshanshariff - I just stumbled on file-name-all-completions. I wonder if this would be at least theoretically faster for checking for file variants, since it's a built-in function written in C?

Example from the docs:

(file-name-all-completions "vim" "/usr/bin/")

("vimtutor" "vimdiff" "vim")

Maybe something like this?

(defun citar-file--find-additional-multi-dirs (key dirs extensions)
  "Return a list of files and variants for KEY from list of DIRS for EXTENSIONS."
  (seq-mapcat (lambda (dir) (citar-file--find-additional key dir extensions)) dirs))

(defun citar-file--find-additional (key dir extensions)
  "Return a list of files and variants from DIR for KEY for EXTENSIONS."
  (seq-filter
    (lambda (fn)
      (member (file-name-extension fn) extensions)) (file-name-all-completions key dir)))

Any reason not to do this?

roshanshariff commented 2 years ago

Unfortunately even though this is written in C, it still has O(n^2) behaviour because it implements the same algorithm as before. Even if it is 10x faster, that won't make up for the ~1000x slowdown from scanning every file in each directory repeatedly, once for each key.

The way around this would be to have a function that scans all the directories and constructs a hash table from keys to lists of associated filenames. Then, when formatting the candidates you would just look up each key in this hash table. This just has a ~2x slowdown, since you first scan all the files and then format all the candidates, but don't scan the files again for each candidate. I'll see if I can code that up this weekend.

bdarcus commented 2 years ago

The way around this would be to have a function that scans all the directories and constructs a hash table from keys to lists of associated filenames.

Is #457 then complementary to that, or a complete alternative?

I do think the code is cleaner, and perhaps it can be used to construct the hash?

If yes, then I'll merge it, and you can iterate on that.

roshanshariff commented 2 years ago

It might make sense to use file-name-all-completions when you're opening the file for a particular key. But I don't think it can be used to construct the hash table --- you don't have a key, but instead need to loop over all the files in a directory and use a regexp to extract the key from the filename.

Perhaps you could leave #457 unmerged, but I'll use the code from it? I'm not sure though, because performance is not that important then (you're only finding the files for a few keys at most) and it might be better to not have two different implementations of the file-matching logic for the two cases.

bdarcus commented 2 years ago

Perhaps you could leave #457 unmerged, but I'll use the code from it?

Sure. The main thing is splitting out the code into separate, simple, functions, that are easier to understand.

There's nothing special about it though.

Except it might be worth considering that one might want to use such functions outside of the context of formatting the candidates?

bdarcus commented 2 years ago

But I don't think it can be used to construct the hash table --- you don't have a key, but instead need to loop over all the files in a directory and use a regexp to extract the key from the filename.

Maybe I'm being dense, but why?

roshanshariff commented 2 years ago

Look at the pseudocode for scanning the directory for associated files while generating the candidate strings:

for key in candidates: # ~2000 keys
   for file in directory: # ~2000 files    
      does file start with key? # runs 4,000,000 times

This is a problem no matter how the inner loop is implemented. Even file-name-all-completions internally has to loop over all files in the directory and check if they start with a particular prefix. The only way to avoid that inner loop is to just check for the existence of a small number of files, those named like "key.ext".

But if you do want to search for additional files, you have to avoid the nested loop by doing this:

for file in directory: # ~2000 files
   key = key_from_filename(file)
   # add (key, filename) to hash table

for key in candidates: # ~2000 keys
   # check if key is present in hash table   

Checking whether a key is in the hash table doesn't loop over all the entries, so this algorithm just has ~4000 operations, rather than the 4,000,000 we had before.

In this second implementation, you don't have a particular key for which you want to find all matching files (which would require that same inner loop as before). Instead, you have a filename and you need to figure out which key it belongs to.

Hope that makes sense.

bdarcus commented 2 years ago

So then the only way to be able to extract the key in that case is with a regex that assumes some separator (space, etc.).

bdarcus commented 2 years ago

Broader issue addressed via #460.