crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.39k stars 1.62k forks source link

[RFC] Automatic Dir rewind #8504

Open didactic-drunk opened 4 years ago

didactic-drunk commented 4 years ago

Dir#methods give unexpected results when called more than once. The first call gives correct results. Subsequent calls without using #rewind return empty Array's.

The low level opendir API has this behavior. Should the high level Crystal API retain it? Is there a use case for having an empty array returned when attempting to use Dir#Iterable?

dir = Dir.new("spec")

p dir.entries
p dir.entries

Output:

# First call.
[".", "..", "std", "compiler_spec.cr", "manual", "support", "spec_helper.cr", ...]
# Second call.
[]
asterite commented 4 years ago

I think it makes sense. In Ruby it works the way you want.

However, it's a bit confusing:

$ irb
irb(main):001:0> dir = Dir.new(".")
=> #<Dir:.>
irb(main):002:0> dir.read
=> "."
irb(main):003:0> dir.read
=> ".."
irb(main):004:0> dir.entries
=> [".", "..", ".nix-channels", ".stack", ...]
irb(main):005:0> dir.read
=> nil

So after calling dir.entries the point you had with read is lost (you go straight to the end). But it's probably fine (so we need to call rewind before doing entries and others).

didactic-drunk commented 4 years ago

I've never seen the two API's mixed.

rewind before, got it.

straight-shoota commented 4 years ago

What about the yielding method #each? It would be feasible to combine it with .read. The values for #entries and #each should be equivalent (and the implementation depends on it).

IMO the suggested behaviour is not necessarily more valid than the current one.

I'd rather ask why you would want to call Dir#entries twice in the first place. It should usually be better to reuse the first result.

And I'd argue that for most common use cases, you don't need a Dir instance at all and just use the class method API. If you're using Dir instances, there must be a reason to it, which is usually that you want to control how to read entries. That means you should be aware that it uses an internal pointer that can be advanced and rewound.

I think the main problem here is that Dir instances expose two different APIs that can't be used in combination. So maybe we should consider if there is even a valid use case for having Dir#entries over Dir.entries. Removing #entries would avoid any weird behaviour because then there's only the regular Enumerable behaviour.

didactic-drunk commented 4 years ago

What about the yielding method #each? It would be feasible to combine it with .read.

Who would do that and why?

The values for #entries and #each should be equivalent (and the implementation depends on it). I'd rather ask why you would want to call Dir#entries twice in the first place. It should usually be better to reuse the first result.

Two places so far:

  1. In specs validating a sequence of steps that write several files in each step.
  2. Polling a directory for changes.

And I'd argue that for most common use cases, you don't need a Dir instance at all and just use the class method API. If you're using Dir instances, there must be a reason to it, which is usually that you want to control how to read entries. That means you should be aware that it uses an internal pointer that can be advanced and rewound.

Nope. I'm using #children or #each and never #read.

Crystal doesn't have portable file system event API so I'm stuck with polling. Even if it did I'd probably cache the hot Dir objects for reuse.

The current behavior is surprising. Should I need to know implementation details when I'm using a high level API? Is there a benefit to the current behavior? Mixing and matching #read with #each seems unlikely.

I think the main problem here is that Dir instances expose two different APIs that can't be used in combination. So maybe we should consider if there is even a valid use case for having Dir#entries over Dir.entries. Removing #entries would avoid any weird behaviour because then there's only the regular Enumerable behaviour.

Enumerable needs rewind to function, otherwise iteration 2..n are empty.

Also, I'm using #children more than once.

straight-shoota commented 4 years ago

What about the yielding method #each? It would be feasible to combine it with .read.

Who would do that and why?

#each is just a loop over #read. If I want to read multiple entries using the low level API, can just use #each.

  1. In specs validating a sequence of steps that write several files in each step.
  2. Polling a directory for changes.

The question is, why do you call #entries on an instance of Dir instead of Dir.entries? The second example could be a performance-sensitive use case where keeping the dir handle is more efficient. Caching the dir handle however may cause problems if the directory is removed (and recreated) or some file system shenanigans. Then the results from the accurate Dir.entries and fast Dir#entries would diverge. So caching a dir handle for repeated use may not actually be a good idea at all.

Nope. I'm using #children or #each and never #read.

These also exist as class methods and you don't necessarily need an instance. If you don't combine the instance methods with #read, there won't even be any surprises.

The current behavior is surprising.

So would be the proposed behaviour. There's no way to avoid surprises when mixing two different behaviours in a single API with shared state.

Should I need to know implementation details when I'm using a high level API?

I'd consider the class methods of Dir a high level API, but not the instance methods because they allow to operate the directory handler directly.

Is there a benefit to the current behavior? Mixing and matching #read with #each seems unlikely.

Mixing #read and #entries is also unlikely. Why do we need a change, then? When operating #read, you need to be aware of the dir handle position. And in that use case it's likely to assume #each and #entries to respect the current position, not rewind. If they're supposed to start from the beginning, a call to #rewind is necessary.

Enumerable needs rewind to function, otherwise iteration 2..n are empty.

That's not really an issue because Enumerable does not guarantee repeated calls to return the same results. And changes to the file system have the same effect, by the way.

Sija commented 4 years ago

IIUC the main concern is mixing internal impl. detail (iterator) vs external API (#entries) - which should be free of impl. side-effects, like a dir handle position.

RX14 commented 4 years ago

I was originally thinking about throwing away Dir and turn dirfds into just an Iterator instance you get from somewhere (File or Path). But we removed #rewind now, so that's not a feasible API any more.

Storing a dirfd handle is neccesary to avoid race conditions. It should be possible to iterate the entries, and open a file relative to a handle (the latter isnt possible currently). Conflating the directory handle and the iterator state is neccesary to accurately represent the API, unless we handle it "magically" internally by duping the fd before iterating with it.

didactic-drunk commented 4 years ago

The second example could be a performance-sensitive use case where keeping the dir handle is more efficient. Caching the dir handle however may cause problems if the directory is removed (and recreated) or some file system shenanigans. Then the results from the accurate Dir.entries and fast Dir#entries would diverge. So caching a dir handle for repeated use may not actually be a good idea at all.

Keeping the same dir handle if renamed is the desired behavior and necessary for state tracking. Not using the same handle would cause the problems you describe.

Nope. I'm using #children or #each and never #read.

These also exist as class methods and you don't necessarily need an instance. If you don't combine the instance methods with #read, there won't even be any surprises.

I need to track these across renames, which means keeping the dir handle.

The current behavior is surprising.

So would be the proposed behaviour. There's no way to avoid surprises when mixing two different behaviours in a single API with shared state.

Mine is a easy to hit I admit rarer case where the dir handle is reused, but the counter case you provide is so rare I can't see it being an issue ever.

To hit a bug with the case you provide requires:

  1. Mixing 2 API's. Why would someone do this?
  2. On the same dir handle.
  3. Using them in the exact wrong order. #entries in the middle of iterating using #each or #read. Who would do that?

Your counter point is unrealistic. No one does that.

The same bug would occur if someone tried mixing #each and #read, or two concurrent loops using #each or #read. There's no fix for it.

My use involves:

  1. Call dir.entries
  2. Do it again.

Or.

  1. Call dir.each
  2. Do it again.

That's a realistic use case. I've used it twice already and once is with old code from ruby.

Mixing #read and #entries is also unlikely. Why do we need a change, then? When operating #read, you need to be aware of the dir handle position. And in that use case it's likely to assume #each and #entries to respect the current position, not rewind. If they're supposed to start from the beginning, a call to #rewind is necessary.

I'm not. I'm using #entries or #each twice. Ruby doesn't have this issue. Maybe that's why my code broke when moving to crystal.

Ruby is proof the behavior I propose can work without common surprises.

Enumerable needs rewind to function, otherwise iteration 2..n are empty.

That's not really an issue because Enumerable does not guarantee repeated calls to return the same results. And changes to the file system have the same effect, by the way.

Ok. Is there a benefit to keeping it that way? Or should the API be easy to use? Can anyone name a benefit 2..n calls to #each having an empty set. What is the use case?

asterite commented 4 years ago

I think this is also a bit related to iterators.

$ irb
irb(main):001:0> i = [1, 2, 3].each
=> #<Enumerator: [1, 2, 3]:each>
irb(main):002:0> i.to_a
=> [1, 2, 3]
irb(main):003:0>
irb(main):003:0>
irb(main):003:0> i.to_a
=> [1, 2, 3]
irb(main):004:0> i.next
=> 1
irb(main):005:0> i.to_a
=> [1, 2, 3]
irb(main):006:0> i.next
=> 2
irb(main):007:0> i.next
=> 3
irb(main):008:0> i.to_a
=> [1, 2, 3]

In Ruby an "iterator" (Enumerator) keeps an internal state of the iterator, but methods like to_a, map, select, etc., always operate on the entire collection. Crystal doesn't behave like that. So I don't know if we should do it for Dir given that we don't do it for Iterator.

If you use entries twice and in the second time you get an empty list, you can call rewind. That's more explicit. And you avoid calling an extra rewind each time you iterate the dirs.

So in the end, I'm either fine with making this change and not making it. I don't think iterating a dir handle multiple times is such a common use case.

RX14 commented 4 years ago

Yes, this is how it is in crystal:

f = (1..5).each

f.to_a # => [1, 2, 3, 4, 5]
f.to_a # => []

However, iterators don't support any kind of rewind. Dir handles obviously do. So I'm not sure what to do. I think it's so subjective it must come down to a vote

:rocket: for rewind before iterating :eyes: to not rewind.

asterite commented 4 years ago

This is what I came up with to be able to implement this:

https://play.crystal-lang.org/#/r/8qg0

For each iterator we actually need two iterators, as explained in the comments there. And each time we return an iterator we need to wrap it in this Ruby iterator (for a lack of a better name).

We can definitely do it, but it’s quite a work, and we need to make sure someone implementing Iterator remembers to use this…

Any other ideas of how to implement this?

asterite commented 3 years ago

I think Dir#entries should rewind.

didactic-drunk commented 3 years ago

Where do we go from here?