Not all files are searchable?

NullVoxPopuli commented 1 year ago

I'm in a rather large monorepo, and it seems the max number of files indexed is 6505.

My monorepo has multiples of tens of thousands of files 🙃

But anywho, I can type in full paths in to the smart-open box, and the file won't show up.

Update:

ah ha, I found this in the README:

max_unindexed (default: 6500)

Stop scanning the current directory when max_unindexed files have been found. This limit is in place to prevent performance problems when run from a directory with an excessive number of files under it.

So I need to make this something like 100_000 🙃

NullVoxPopuli commented 1 year ago

tho, it says max un-indexed. once files are indexed, shouldn't it scan the next batch of 6500?

danielfalk commented 1 year ago

Files aren't indexed until they've been put into the database, and that doesn't happen unless they've been opened. I will make a note to make this clearer in the docs. Thanks for your report.

The 6500 number was (in my tests) a good balance between performance and thoroughness. The reason for having a limit is that if someone opens the plugin in, say, the root directory, it will scan a huge number of files, tanking performance.

We're sourcing our list of files from two different sources mainly:

a database, which is populated from files you've opened
a recursive file scan of the cwd

There is no background process that indexes your directories, while possible, it's challenging for all the usual reasons. It would likewise be challenging to update this index while doing a scan and keep everything performant and up to date unfortunately. If anyone can come up with a way, I'd take that PR of course 😁

NullVoxPopuli commented 1 year ago

could background indexers run while the file search window is open? and then cancel / persist upon close of the window -- doing a final update based on the selection?

I guess, I don't know if lua offers this sort of behavior -- javascript doesn't even have cancellable promises :upside_down_face: (but there are cancellable abstractions, like https://ember-concurrency.com/docs/introduction/)

like, in JS, I imagine something like:

try {
  yield index(); // this would be cancellable
} finally {
  updateDb()
}

danielfalk commented 1 year ago

The goal of this plugin is primarily speed. The picker should show quickly. It should give you the files you're most likely to want with the least amount of typing and moving between items because those things are slow. Even the idea behind using a single keybinding is so that you can reflexively pop it up and jump to something immediately without having to think about whether it's in your recent files, some list of favorites, your buffer list, or in your project etc.

After a couple experiments, I see a couple issues:

Allowing unlimited numbers of files makes for slow matching. Neither native_fzy nor fzf are very responsive when you allow file counts over 10 or 20k, and it just gets worse the more that are added. This is particularly a problem for someone who opens up neovim in their home directory, because it's bound to scan a huge number of files and bog everything down, making the use case of "quickly switching between buffers" just impossible.
With large directories, it can take several seconds to build a file list. While this will vary widely based on disk speeds, I'm seeing ~1sec/100k files. However, we want the popup to show results in well under a second, so we'd want to figure out something here. Also, loading a large set of files all at once seems to create enough I/O that telescope's previewer often times out for previewing that initial file.

Does this mean it's impossible to support larger repositories? Well no, but as it is, the default max_unindexed option value allows the primary goal to still work for the more common cases. And if you want to open neovim from the base directory of a large monorepo, you can just increase this option and accept that it will bog down the experience in such large repos (as well as in directories higher up such as ~ and /)

It should even be possible to remove the option of max_unindexed altogether. We'd just have to figure some things out:

We'd want to throttle (but not limit) the directory scan a little, so it allows telescope enough time to display some preliminary results along with a preview.
As more files get scanned, they should be updated "live" in the picker, in proper ranking order.
fzy/fzf are doing too much work on matching strings. There needs to be some way of cutting down on the number of files that the match algorithm has to work with on each new keystroke.

If someone can solve these 3 problems, we'd have a great solution, but I don't think I'd be up for doing that any time soon. Looking for a contribution here.

danielfalk commented 1 year ago

An update on this: I've started working on the problem and I think it's possible to use a combination of performance optimizations and the new multithreading support to scale up the amount of files we can search. This will probably take some time to get right, but it now appears to be possible to handle at least 100k file entries.

NullVoxPopuli commented 1 year ago

Excitement!

danielfalk commented 1 year ago

I've just pushed a 0.2.x branch which removes max_unindexed and uses multithreading to support larger directories. Of course it's still possible for the searching to be somewhat slower, but it no longer feels unresponsive since keystrokes can be registered.

Anyway, there are a lot of changes in this branch, so I'm hoping to get some more usage testing on it before releasing it. So if you could try it out, I'd appreciate it.

(It has a couple other changes as well that are likewise experimental, such as a different way of formatting entries.)

NullVoxPopuli commented 1 year ago

Thanks for doing that!

So, I finally got around to testing that branch, and one off thing I noticed was that the list of results always seems capped:

https://user-images.githubusercontent.com/199018/226700850-122c350a-3293-45b4-b545-0a34138dd90f.mp4

danielfalk commented 1 year ago

Thanks for getting back. I think even in the previous version I was capping the result limit because telescope appeared to be slower with larger result sets. Would it work if we just increased the cap? (The way I figure it, if you have to scroll up the 100th item, the search isn't really doing you any favors in the first place...)

Incidentally, one of the biggest (if not the biggest) slowdowns for searching larger amounts of entries was surprisingly the time it takes to sort all items after ranking them. Largely due to lua's multithreading limitations, sorting has to take place on the main thread, so limiting the amount of results speeds it up.

NullVoxPopuli commented 1 year ago

It seems on the v2 branch I have duplicate results? :upside_down_face:

danielfalk commented 1 year ago

This is weird behavior. I'm using the v2 branch myself and haven't experienced this yet so I'm not sure. Any ideas what might be the cause, such as anything out of the ordinary about the directory you're working in etc.? Also strange that you have a blank entry there.

NullVoxPopuli commented 1 year ago

Something else I noticed -- I'm getting results from beyond my CWD/PWD. Is there a config to not allow that? I only want searching to be scoped to my CWD. :thinking:

It feels like the v2 branch is missing functionality from the regular branch.

Here is my config: https://github.com/NullVoxPopuli/dotfiles/blob/main/home/.config/nvim/lua/plugins/finding.lua#L18-L41

Any ideas what might be the cause, such as anything out of the ordinary about the directory you're working in etc.?

not really, just a github repo monorepo with some javascript projects that uses github actions

NullVoxPopuli commented 1 year ago

new computer, new repos to test on -- will keep ya updated on my v2 explorations (I only have the v2 branch specified locally)

Current config:

  use {
    "danielfalk/smart-open.nvim",
    branch = '0.2.x',
    config = function()
      local telescope = require('telescope')

      telescope.load_extension("smart_open")
      -- telescope.load_extension('fzy_native')
      telescope.setup({
        extensions = {
          smart_open = {
            cwd_only = true,
            max_unindexed = 50000,
            ignore_patterns = {
              "*.git/*", "*/tmp/",
              "*/vendor/",
              "*/dist/*",  "*/declarations/*", "*/node_modules/*"
            }
          },
          -- fzy_native = {
          --   override_generic_sorter = false,
          --   override_file_sorter = true,
          -- }
        }
      })
    end,
    requires = {
      -- "nvim-telescope/telescope-fzy-native.nvim",
      "nvim-telescope/telescope.nvim",
      -- sudo apt install sqlite3 libsqlite3-dev
      "kkharji/sqlite.lua",
      "kyazdani42/nvim-web-devicons"
    }
  }

danielfalk commented 1 year ago

It's on my TODO list to fix what's causing this, assuming we can find a cause. I think it must be some sort of race, but I'm not sure. Once this is fixed, I can start recommending 0.2.x for everyone.

danielfalk commented 1 year ago

I think I figured it out. The latest 0.2.x branch should have a fix for this, and @NullVoxPopuli you also should be able to use your config snippet above to configure cwd_only, if you want to check it out and let me know if that works!

danielfalk / smart-open.nvim

Not all files are searchable? #13