bluesky / tiled

API to structured data
https://blueskyproject.io/tiled
BSD 3-Clause "New" or "Revised" License
59 stars 50 forks source link

Expose a good interface for ignoring files #381

Open padraic-shafer opened 1 year ago

padraic-shafer commented 1 year ago

When ignore_re_files is passed as an argument to DirectoryAdapter.from_directory(), files matching the RegEx pattern are the only files that get served.

From the name of the parameter and its documentation, I would have expected the opposite behavior--namely that these files would be ignored and not served.

https://github.com/bluesky/tiled/blob/52bede5ea1768b83807556ec7cae8bc147535dc4/tiled/adapters/files.py#L141-L146

danielballan commented 1 year ago

Huh. I wonder how that went backward. I agree that this is clearly a bug.

padraic-shafer commented 1 year ago

Here's my best guess so far.. (I haven't had a chance to test any of this. Documenting it here for whenever someone has time.)

https://github.com/bluesky/tiled/blob/7c6a8dd06d5eb23fac8402a68aedc467983831f2/tiled/adapters/files.py#L287-L318

On the other hand, watchgod is only keeping tabs on the files/directories that are matching the ignore parameters; so that aspect would need modifications as well.

https://github.com/bluesky/tiled/blob/7c6a8dd06d5eb23fac8402a68aedc467983831f2/tiled/adapters/files.py#L454

Here is the relevant code from watchgod.watcher.RegExpWatcher 0.8.2

class RegExpWatcher(AllWatcher):
    def __init__(self, root_path: Union[str, Path], re_files: Optional[str] = None, re_dirs: Optional[str] = None):
        self.re_files: Optional[Pattern[str]] = re.compile(re_files) if re_files is not None else re_files
        self.re_dirs: Optional[Pattern[str]] = re.compile(re_dirs) if re_dirs is not None else re_dirs
        super().__init__(root_path)

    def should_watch_file(self, entry: 'DirEntry') -> bool:
        if self.re_files is not None:
            return bool(self.re_files.match(entry.path))
        else:
            return super().should_watch_file(entry)

    def should_watch_dir(self, entry: 'DirEntry') -> bool:
        if self.re_dirs is not None:
            return bool(self.re_dirs.match(entry.path))
        else:
            return super().should_watch_dir(entry)
padraic-shafer commented 1 year ago

From my point of view, specifying which files/directories to match is just as valid as which ones to ignore. So the simple fix might be to just rename the parameter to include_re_*.

padraic-shafer commented 1 year ago

Here is the relevant code from watchgod.watcher.RegExpWatcher 0.8.2

Btw, at some point will probably need to migrate from watchgod to watchfiles.

https://github.com/bluesky/tiled/blob/23cde094b737987e5ec30d497c87563a5d6dd0d5/requirements-server.txt#L30

danielballan commented 1 year ago

Thanks for digging into this.

I've been vaguely aware (and feeling a bit guilty) that the watchgod-related parameters are under-tested. This is a relic of the early stage of Tiled development where we just wanted to know, "Is this idea even coherent?!" It needs to be revisited with more thought to design.

I think, along with #300, we will need a significant overhaul of the tiled.adapters.files. In the back of my mind, I was thinking we might migrate to wtachfiles as part of that overhaul, and rethink the user-facing parameters as well. Perhaps a simple callable that return a boolean for "include or not" would be best.

padraic-shafer commented 1 year ago

All good :)

A project this size is bound to have some corners to tidy up. On balance, tiled is a fantastic achievement that simplifies finding and accessing datasets.

danielballan commented 1 year ago

As promised #511 completely reworks the file-walker and renders the original issue here moot.

I provides a filter argument that gets a function filter(filepath) -> bool. It uses the same filter for:

  1. The initial crawl of the directory, which does not use watchfiles
  2. Watching the directory, using watchfiles

As a low-level, maximally flexible interface, I think this simple function is correct. But it probably makes sense to expose a glob or regex pattern interface, especially to the CLI, e.g.

tiled serve directory --ignore *.stuff

That is not done in #511. I will re-scope this issue to address that, as some of the considerations raised above may be helpful.

padraic-shafer commented 1 year ago

I started trying to implement this with --include and --exclude options, similar to how they are used in rsync. However, typer does not seem to support preserving the order of these optional parameters when they are interspersed on the command line. That is, it will provide an ordered list of "include" values and a separate ordered list of "exclude" values, but the order of interdigitated "include/exclude" values gets lost.

A workaround might be to:

Another workaround might use @typer.Typer().callback() to hijack the command line context and manually extract --include and --exclude options in the order they were specified. This seems pretty hacky though, and counterintuitive to using a nice CLI parser.


I'm looking for suggestions.

FWIW, I was thinking of using --include & --exclude for shell-style expansions (?, , , ) and --include-re & --exclude-re for RegEx patterns.

padraic-shafer commented 1 year ago

Another workaround might use @typer.Typer().callback() to hijack the command line context and manually extract --include and --exclude options in the order they were specified. This seems pretty hacky though, and counterintuitive to using a nice CLI parser.

After looking into this a bit more, this approach looks fairly straightforward. Here is a related solution that subclasses the underlying click.Command class to override the parsing behavior.

danielballan commented 1 year ago

I like supporting shell-style and regex-style options, and I like the names proposed.

I'm of two minds about subclassing click.Command. It's good that there is a supported pathway for this, but I would feel more confident that it is worth the added depth-of-integration if we had a couple use cases. I might be inclined to wait until we need it and start with "they're mutually exclusive". OTOH, it is nice to start with a fully complete and correct implementation.

I'm curious what others think here.

padraic-shafer commented 1 year ago

For simplicity we could start with implementing only an "exclude" option. I imagine that most use cases involve exposing all files / objects in the container while hiding only a few exceptions (such as dot files or metadata sidecars).

I'm inclined to start with the RegEx version because using wildcards on the CLI will require protecting them from unintended shell expansion. Wrapping the parameters with single-quotes seems like a tedious, error-prone step that might frequently be overlooked. I'm rethinking whether the shell-style option makes sense here; maybe it should be limited to entries in a configuration file provided to an --exclude-file option?