Open bongohrtech opened 12 years ago
This requires too much of an API change to implement in 3.0.3. Unassigning from 3.0.3, and we can revisit it in a later release.
by ccurrens
We should probably look into doing this. There are several places in the code base where the dispose pattern is not used correctly in conjunction with iterators/enumerators.
I am not sure if it makes sense to make every type of iterator into an enumerator. For example BreakIterator is designed to navigate forward, backward, and to skip to specific points. Perhaps we could still implement IEnumerable
by nightowl888
How about something like...
public abstract class TermEnum : IEnumerable, IDisposable { public abstract bool Next(); public abstract Term Term(); public abstract int DocFreq(); public abstract void Close(); public abstract void Dispose(); public IEnumerator GetEnumerator() => new EnumEnumerator (Next, Dispose, () => new TermFreq { Term = Term(), DocFreq = DocFreq() }); IEnumerator IEnumerable.GetEnumerator() => GetEnumerator(); } public class TermFreq { public Term Term { get; set; } public int DocFreq { get; set; } } public class EnumEnumerator : IEnumerator { private readonly Func next; private readonly Action dispose; private readonly Func currentFactory; private bool started = false; public EnumEnumerator(Func next, Action dispose, Func currentFactory) { this.next = next; this.dispose = dispose; this.currentFactory = currentFactory; } public T Current => started ? currentFactory() : default(T); object IEnumerator.Current => Current; public bool MoveNext() => (started = next()); public void Reset() => throw new NotImplementedException(); public void Dispose() => dispose(); }
Excuse the c#7 lambda stuff, just used for brevity. Would expand if theory accepted.
The idea is to have a default generic Enumerator which can be used for any Iterator type. So, simply add the IEnumerable i/f and the two GetEnumerator methods returning the generic. It takes Next, Dispose and a factory to create the items.
The started member is to emulate the enumerator semantic of being null before the first call to MoveNext.
Enum's aren't resetable so just throw?
In this example I've kept the item type simple. But it would probably be implemented as an immutable.
Observation: There don't seem to be many Enum types (those that have bool Next()) I'm assuming other have already been converted?
Thoughts?
by andypook
The types I am aware of are the TermsEnum and DocIdSetIterator and their subclasses (which include DocsEnum and DocsAndPositionsEnum classes). There are probably a few others. Anything that is not public facing isn't worth the effort of refactoring, so there isn't much after these that haven't already been converted.
BTW - there is no longer a TermEnum class in Lucene 4+ - it has been replaced with TermsEnum.
The issue isn't so much with implementing IEnumerator
I looked at TermsEnum in some depth and I think the best solution there would be to divide it into "basic" and "advanced" APIs. Simply rename Terms.GetIterator(BytesRef) to Terms.GetEnumerator(BytesRef) and add an overload that accepts no parameter and satisfies the IEnumerable
Since a common use case is to loop forward, this would simplify some code:
Terms vector =TermsEnum termsEnum = vector.GetIterator(null); BytesRef text; while ((text = termsEnum.Next()) != null) { // use text }
becomes
Terms vector =foreach (BytesRef text in vector) { // use text }
But you could also call GetEnumerable(BytesRef) in order to reuse a BytesRef or to get to all of TermsEnum s goodies that aren't part of the IEnumerator
I don't think this same approach will work for DocsEnum or DocsAndPositionsEnum - it would be nice to find a solution that can be applied consistently. I am not so sure those types fit the mold of IEnumerator
Then again, there must be some reason why the Lucene designers strayed away from using Iterable
by nightowl888
Oh, one more thing. I was planning on getting to this after the next beta release. There are several bug fixes, optimizations, and 5 new NuGet packages I would like to push out (and hopefully the API docs) before getting to these breaking API changes.
by nightowl888
So I have a little time to get into this then
Probably nothing new here to you. Mainly just me catching up...
To your first point about GetEnumerator being somewhat hidden/opaque, isn't that just something that is the usual guidance in c#. ie don't mess with the collection from inside a for cos you'll break the enumerator.
In the fairly simple case in the code above you could...
(NB just using TermEnum as it's a simple example)
TermEnum vector =foreach (var item in vector) { var x = vector.Term(); var y = item.Term; // same thing as 'x' vector.DoStuff(); // but you should probably steer clear of using 'Next()' }
There's no need to extend/cast or get access to the enumerator. The enumerators job is to simply move a cursor forwards and support a language construct.
Of course this would make it still look weird to a c#/vb dev.
Current is supposed to return the current value of the cursor. Maybe it could return Current as the collection ('vector' above). So it would look like...
TermEnum vector =foreach (TermEnum item in vector) { var x = item.Term(); }
Again, weird. Doing the method -> property change would help. But, still feels odd to have the item being the same thing as vector.
Here lies the basic problem. The Lucene "iterable" classes represent all of:
Also, the need to allow the Current item to be; created on each iteration; reused; fetch from object pool; ...
I'm fairly sure that the EnumEnumerator (or versions of) can handle all that. I'd need to get further into the use cases.
Lastly, there's the observation that the while vs foreach examples you gave are not equivalent as the while version does not call Dispose().
I suspect that in current usage Dispose is never called. Or if it is it's because it's associated with files rather than because we're leaving a loop. Does that match your experience?
Thinking about it, maybe the enumerator Dispose should be a noop? It's odd for a type to be disposing it's "parent" (though I know some StreamWriter/Reader classes do, it's still odd/annoying). Again there's a scoping concept here that's muddled by the type being many things.
All very Tricky...
I think there's some combination of
It's going to be very hard to get the semantics of each type in a sane shape such that it's hard for a dev to get tripped up.
Which is why this have been open for 5+ years
I'll see if I can put together a proof of concept PR
by andypook
No problem. I am just a bit surprised given the choice for some prime green field development vs this headache that you would choose this path. But nonetheless, my aching head thanks you .
By now you have probably worked out that IBytesRefIterator should inherit IEnumerator
The only real trick is ensuring the other methods on TermsEnum and its subclasses are exposed when implementing the IEnumerable
Lastly, there's the observation that the while vs foreach examples you gave are not equivalent as the while version does not call Dispose().
That is just because TermsEnum doesn't implement IDisposable yet. Of course, that all changes when implementing IEnumerator
And there are several places throughout the codebase where enumerators are utilized without a using block or calling Dispose(). The tests aren't such a big concern, but we really shouldn't be doing this in the production code. I am sure there are some places where you can put your own implementation with its own unmanaged resources and watch the resource leaks pile up because we aren't doing our job and calling Dispose(). The real trick is locating them. I really wish Microsoft would put a feature in Visual Studio that searches for code that should exist but doesn't .
Then again, I haven't been using Code Analysis. Maybe that (or Resharper?) can be used to quickly detect all of the Dispose() violations. If you can find them, maybe that is a PR that should go ahead of the refactoring.
It's odd for a type to be disposing it's "parent" (though I know some StreamWriter/Reader classes do, it's still odd/annoying).
Agreed - when using Dispose() in conjunction with the decorator pattern things get odd. But how does that apply to TermsEnum? Disposing a superclass is not quite the same thing (and is rather expected using the dispose pattern).
All very Tricky...
I think there's some combination of
- isolating the "item" type
- using the collection object behaviors from inside the loop
- implementing the enumerator such that it is ok about the cursor being changed from outside
It's going to be very hard to get the semantics of each type in a sane shape such that it's hard for a dev to get tripped up.
In Java the "item" type is usually the one that is returned from the Next() method. But the real issue is all of the "extras" that come along with these iterators and how to expose them without casting the result of GetEnumerator(). Keep in mind the "extras" (ala LINQ) are usually part of the IEnumerable
If you go that route, note that because we are a port we had to redefine what "maintainable" means. So, when refactoring a class into many, the typical approach I follow is to drop all of that code into the same code file that is being refactored (in the same order as much as possible). This will make it easy to figure out where that code is when upgrading to the next Lucene version. The standard approach of putting each type into a file on its own would be chaos for a port version upgrade.
But you seem to have a handle on things. If there is a solution to be found here, I am sure you will be able to find it. If you need a sounding board to run your ideas by, I am here.
Which is why this have been open for 5+ years
Yea, I bet the decision to wait until the "next release" was a decision that came with some regret. Unfortunately, Lucene.Net struggled to get over the hump of getting a 4.x release because the project grew by a factor of 10. We are almost there now, and the path to catch up to Lucene 6.x is now clear considering how much smaller the change set is between here and there than the monster change set between 3.x and 4.x.
by nightowl888
Yeah, well, I thought, that's looks like it might be isolated, small. Ha!
I'll poke at it some more, see what turns up. But I may just punt and look for something else
I'm not sure I'm on the same wave length just yet...
"how does that apply to TermsEnum? Disposing a superclass is not quite the same thing"
But the enumerator is not a super class. It's a "child" of the TermsEnum (or whatever) created for a purpose. Disposing the child shouldn't dispose the "parent". I'm fairly sure that IEnumerator has Dispose so that things like listing files (managed resource) can be cleaned up. But still, it's about cleaning up resources used by the enumerator not by the parent collection type.
If you want to Dispose the collection type then that's a separate operation. There may be other behaviors available, appropriate on the collection type after you've looped some aspect of it.
Right ??
Also a bit confused around the "extras" topic. I don't see the problem. We'd get all the linq bits for free. You're right, they are generally defined on IEnumerable
I was thinking by "extras" you meant "other things on the collection type". By my theory those are either: part of the item and exposed there; or an op you call in the collection object from inside the loop.
What am I missing?
by andypook
Ahh...I think I now know where we are getting our wires crossed.
You have become a victim of the Java developers' incessant need to abbreviate everything (which annoys the crap out of me, but unfortunately fixing it is a losing battle).
Essentially, TermsEnum == IEnumerator
So, rather than loading up the IEnermerable
So, when disposing TermsEnum, we are doing the standard IEnumerator
by nightowl888
Yes, agreed. I think I wasn't quite using the right words to disambiguate.
So... take a breath... lots of Esomething words coming....
Yes, the Enum types have the characteristics of a c# IEnumerator (Next() etc).
But also represent the current item (via Term etc).
But also has additional "collection" type operations (SeekExact() etc) (are these part of what you were referring to as "extras"?).
It's this Enum class I was calling the "collection" type 'cos it kind of holds the collection state.
What I was proposing was to make the Enum also look like an IEnumerable so that you can foreach over it.
Which means it needs a GetEnumerable() to return an IEnumerator. This new enumerator is the EnumEnumerator type from in previous messages.
So, I'm taking an Enum, adding an IEnumerable facade, to return a new IEnumerable type which pokes the Enum to move forward.
It was the Dispose() of this new IEnumerator that I was referring to. Pretty convinced that this one should always be a no-op. Exiting the loop should not Dispose the Enum (ie the "collection") and as the new IEnumerator is just poking methods on the Enum, there's nothing for it to clean up.
Pretty sure I've got all the right E's in the right place
NB: some thought would be needed to define what the "item" class (representing Current) should look like in each case.
NB: Calling a Current changing method (ie one of the Seek...() methods) from inside a foreach would result in undefined behavior.
Phew... Did that make sense; fit with your view of the universe?
by andypook
It's this Enum class I was calling the "collection" type 'cos it kind of holds the collection state.
What I was proposing was to make the Enum also look like an IEnumerable so that you can foreach over it.
Which means it needs a GetEnumerable() to return an IEnumerator. This new enumerator is the EnumEnumerator type from in previous messages.So, I'm taking an Enum, adding an IEnumerable facade, to return a new IEnumerable type which pokes the Enum to move forward.
That is an interesting idea.
I am not sure if the state is actually being "iterated over" so much as it is being calculated on the fly. But then, IEnumerable
> NB: some thought would be needed to define what the "item" class (representing Current) should look like in each case.
> NB: Calling a Current changing method (ie one of the Seek...() methods) from inside a foreach would result in undefined behavior.
So then we end up with some behavior that doesn't fit the mold.
I thought for a moment that it could be possible these SeekCeil and SeekExact methods are only defined here because there are no such thing as extension methods in Java, but since they are overridden in subclasses, that is probably not a valid assumption.
Speaking of which, I suggest you have a look at how they are used in the TermsEnum overloads in the codecs (for example [Lucene40TermVectorsReader](https://github.com/apache/lucenenet/blob/468199e3fa95c7e1f77b14e7f00aeaafd7c2f8b9/src/Lucene.Net/Codecs/Lucene40/Lucene40TermVectorsReader.cs#L446)) or [Lucene45DocValuesProducer.TermsEnumAnonymousInnerClassHelper](https://github.com/apache/lucenenet/blob/468199e3fa95c7e1f77b14e7f00aeaafd7c2f8b9/src/Lucene.Net/Codecs/Lucene45/Lucene45DocValuesProducer.cs#L1090).
As you can see, the complexity of these enums becomes quite high (lots of internal state), and it seems to me it would be simpler to port future changes from Lucene if the general structure of the "extras" (that is, the properties and Seek* methods) were left alone (or at least left intact and used by some other type of facade). The DocsAndPositionsEnum s are even more complex than this.
The only thing I was hoping to get out of this was the added ability to use a foreach loop on Terms (and possibly the DocIdSetIterator). Sure we could refactor the TermsEnum data structure to be more .NET-like, but then where would we be when we need to connect the dots on the next Lucene upgrade? I am not necessarily against it, but finding bugs in the codec code is no picnic, so we would need some clear way to ensure future codecs behavior can be mapped onto the new structure without causing too much pain. Keep in mind, the codecs change in every major (and nearly every minor) version.
Phew... Did that make sense; fit with your view of the universe?
Wow, do I sound so closed minded? I think I need to work on my delivery .
by nightowl888
Wow, do I sound so closed minded?
Ha. No not at all more worried I just not seeing the whole picture.
not sure if the state is actually being "iterated over" ...
Well, in so far as Next moves a set of values one step forward, it can certainly be thought of that way. I think of these like a cursor over a sql table. You can foreach over the rows. How the data is actually retrieved is abstracted. It could be one at a time, batches, all of them, cached... At the foreach level you just call Next to inspect the values of the next item.
IEnumerable
is just an interface - it doesn't actually define any behavior.
Right, this is one of those language syntax enabling things. You actually don't need to specify the interface. As long as the class implements GetEnumerator() then foreach will work.
So then we end up with some behavior that doesn't fit the mold.
True. But that's completely normal.
Imagine an enumerator over a Stream that pulls a line of text at a time. What would happen if you called myStream.Seek(...) inside the loop? Or a foreach over a Dictionary and call myDictionary.Add(...)? The behavior is to either throw (because the Dictionary.Enumerator protects against this) or some undefined behavior depending on which Stream/Reader or underlying storage is being used. Having overridden implementations with different behaviors is also to be expected. Many hours of fun to be had following Stream/TextReader etc. There's no guarantee that descendant classes will have similar behavior at he edges.
My theory was to enable foreach that would produce items representing the values of the Enum that change when its Next is called. This would be entirely an overlay/facade. All the existing members would remain presenting a dual api.
This way copy/pasting code from Java is still possible (modulo c# style properties). But the while style can be refactored to foreach. Or green field dev can be done in whichever style the dev likes.
All linq style stuff will work. I don't think the "extra" (Seek etc) are a problem. Most of them aren't composable as they return a single item or perfrom some op over the "collection". I mean for example you probably wouldn't call SeekExact from inside a foreach, in the same way you would use a dictionary indexer if you were foreach'ing over it. Or call Sort inside a foreach over a List.
Those that are composable (ie the various Docs methods) will work (assuming the DocsEnum is converted. So you could image terms.Docs(...).Where(d=>...).Select(...) working. But I would expect them to be at the beginning of the chain.
I guess you could image a few extra linq style extensions for special cases. But once you're into enumerable land it does cause some restrictions. Unless you want to get into rewriting expressions trees (runs away screaming).
You are quite right, trying to fit this in such that it's consistent across the class hierarchies is going to be a challenge for sure
I'm away this weekend but I'll try to put together a PR with a few examples/tests new week to see if it'll work out.
by andypook
GitHub user AndyPook opened a pull request:
https://github.com/apache/lucenenet/pull/212
GH-469 - enumerables
See https://issues.apache.org/jira/projects/LUCENENET/issues/GH-469 for background.
This is an early stage proposal for implementing IEnumerable
This prototype provides `EnumEnumerator
I have modified `IBytesRefIterator` and `TermsEnum` to add `IEnumerable
The test (such as it is) demonstrates retrieving a `TermsEnum` from a reader. Then using `foreach` and a simple `.Count()` linq operator.
If this approach is acceptable it ought to be adaptable to other Enum types
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/AndyPook/lucenenet GH-469
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/lucenenet/pull/212.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #212
commit 008df498bcba7e9940ca97efc83d656ddb03b2c3
Author: Andy Pook
Date: 2017-08-14T21:32:45Z
simple editorconfig
ensures spaces for indent
commit 59646be0cf8fbc7e8ad64f395f451a11aca57fc6
Author: Andy Pook
Date: 2017-08-14T21:34:48Z
add EnumEnumerator helper
commit 526ddaf8c837f82cd141324e922372155914701b
Author: Andy Pook
Date: 2017-08-14T21:35:32Z
add IEnumerable to TermsEnum/IBytesRefArray
commit 456df8e8347ef67055b712722b50cc3b1bbf26cc
Author: Andy Pook
Date: 2017-08-14T21:36:54Z
add header, fix usings
commit 1a1c25da7cbcd112d12d6c418c669786cdf4088c
Author: Andy Pook
Date: 2017-08-14T22:25:15Z
add example test
by githubbot
added PR https://github.com/apache/lucenenet/pull/212 which I am sure will need a lot of work. The intent at this point is to just demonstrate the approach
by andypook
Removing fix version of 4.8. This is a general lucene.net task and we should not delay 4.8 release.
by laimis
The Iterator pattern in Java is equivalent to IEnumerable in .NET. Classes that were directly ported in Java using the Iterator pattern, cannot be used with Linq or foreach blocks in .NET.
Next() would be equivalent to .NET's MoveNext(), and in the below case, Term() would be as .NET's Current property. In cases as below, it will require TermEnum to become an abstract class with Term and DocFreq properties, which would be returned from another class or method that implemented IEnumerable .
would instead look something like:
Keep in mind that it is important that if the class being converted implements IDisposable, the class that is enumerating the terms (in this case TermEnum) should inherit from both IEnumerable and IDisposable. This won't be any change to the user, as the compiler automatically calls IDisposable when used in a foreach loop.
JIRA link - [LUCENENET-469] created by ccurrens