bongohrtech / lucenenet

Mirror of Apache Lucene.Net
Apache License 2.0
0 stars 0 forks source link

Convert Java Iterator classes to implement IEnumerable<T> #137

Open bongohrtech opened 12 years ago

bongohrtech commented 12 years ago

The Iterator pattern in Java is equivalent to IEnumerable in .NET. Classes that were directly ported in Java using the Iterator pattern, cannot be used with Linq or foreach blocks in .NET.

Next() would be equivalent to .NET's MoveNext(), and in the below case, Term() would be as .NET's Current property. In cases as below, it will require TermEnum to become an abstract class with Term and DocFreq properties, which would be returned from another class or method that implemented IEnumerable.

 
    public abstract class TermEnum : IDisposable
    {
        public abstract bool Next();
        public abstract Term Term();
        public abstract int DocFreq();
        public abstract void  Close();
    public abstract void Dispose();
    }

would instead look something like:

 
    public class TermFreq
    {
        public abstract Term { get; }
        public abstract int { get; }
    }

public abstract class TermEnum : IEnumerable, IDisposable
{
// ...
}

Keep in mind that it is important that if the class being converted implements IDisposable, the class that is enumerating the terms (in this case TermEnum) should inherit from both IEnumerable and IDisposable. This won't be any change to the user, as the compiler automatically calls IDisposable when used in a foreach loop.

JIRA link - [LUCENENET-469] created by ccurrens

bongohrtech commented 12 years ago

This requires too much of an API change to implement in 3.0.3. Unassigning from 3.0.3, and we can revisit it in a later release.

by ccurrens

bongohrtech commented 7 years ago

We should probably look into doing this. There are several places in the code base where the dispose pattern is not used correctly in conjunction with iterators/enumerators.

I am not sure if it makes sense to make every type of iterator into an enumerator. For example BreakIterator is designed to navigate forward, backward, and to skip to specific points. Perhaps we could still implement IEnumerable but also create other interfaces with extra functionality. In any case, this is a major API change that should be looked at before the 4.8.0 release.

by nightowl888

bongohrtech commented 7 years ago

How about something like...

 public abstract class TermEnum : IEnumerable, IDisposable
    {
        public abstract bool Next();
        public abstract Term Term();
        public abstract int DocFreq();
        public abstract void Close();
        public abstract void Dispose();

        public IEnumerator GetEnumerator() => new EnumEnumerator(Next, Dispose, () => new TermFreq { Term = Term(), DocFreq = DocFreq() });

        IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
    }

    public class TermFreq
    {
        public Term Term { get; set; }
        public int DocFreq { get; set; }
    }

    public class EnumEnumerator : IEnumerator
    {
        private readonly Func next;
        private readonly Action dispose;
        private readonly Func currentFactory;

        private bool started = false;

        public EnumEnumerator(Func next, Action dispose, Func currentFactory)
        {
            this.next = next;
            this.dispose = dispose;
            this.currentFactory = currentFactory;
        }

        public T Current => started ? currentFactory() : default(T);

        object IEnumerator.Current => Current;

        public bool MoveNext() => (started = next());

        public void Reset() => throw new NotImplementedException();

        public void Dispose() => dispose();
    }

Excuse the c#7 lambda stuff, just used for brevity. Would expand if theory accepted.

The idea is to have a default generic Enumerator which can be used for any Iterator type. So, simply add the IEnumerable i/f and the two GetEnumerator methods returning the generic. It takes Next, Dispose and a factory to create the items.
The started member is to emulate the enumerator semantic of being null before the first call to MoveNext.
Enum's aren't resetable so just throw?
In this example I've kept the item type simple. But it would probably be implemented as an immutable.

Observation: There don't seem to be many Enum types (those that have bool Next()) I'm assuming other have already been converted?

Thoughts?

by andypook

bongohrtech commented 7 years ago

The types I am aware of are the TermsEnum and DocIdSetIterator and their subclasses (which include DocsEnum and DocsAndPositionsEnum classes). There are probably a few others. Anything that is not public facing isn't worth the effort of refactoring, so there isn't much after these that haven't already been converted.

BTW - there is no longer a TermEnum class in Lucene 4+ - it has been replaced with TermsEnum.

The issue isn't so much with implementing IEnumerator as the fact that once you use a foreach loop and it calls GetEnumerator(), there would need to be a cast to get to the functionality other than IEnumerator. And inside of a foreach loop you don't have access to the IEnumerator instance, so there is no way to cast to get the rest of the functionality.

I looked at TermsEnum in some depth and I think the best solution there would be to divide it into "basic" and "advanced" APIs. Simply rename Terms.GetIterator(BytesRef) to Terms.GetEnumerator(BytesRef) and add an overload that accepts no parameter and satisfies the IEnumerable contract that internally calls Terms.GetEnumerator(null). The overload that accepts a parameter can then return a TermsEnum type, which will expose the additional members that are hidden in the .NETified overload.

Since a common use case is to loop forward, this would simplify some code:

Terms vector = 
TermsEnum termsEnum = vector.GetIterator(null);
BytesRef text;
while ((text = termsEnum.Next()) != null)
{
    // use text
}

becomes

Terms vector = 
foreach (BytesRef text in vector)
{
    // use text
}

But you could also call GetEnumerable(BytesRef) in order to reuse a BytesRef or to get to all of TermsEnum s goodies that aren't part of the IEnumerator contract.

I don't think this same approach will work for DocsEnum or DocsAndPositionsEnum - it would be nice to find a solution that can be applied consistently. I am not so sure those types fit the mold of IEnumerator anyway, but perhaps there is some way to refactor them to fit.

Then again, there must be some reason why the Lucene designers strayed away from using Iterable in this part of the design - they may have intentionally done it this way for some odd reason.

by nightowl888

bongohrtech commented 7 years ago

Oh, one more thing. I was planning on getting to this after the next beta release. There are several bug fixes, optimizations, and 5 new NuGet packages I would like to push out (and hopefully the API docs) before getting to these breaking API changes.

by nightowl888

bongohrtech commented 7 years ago

So I have a little time to get into this then

Probably nothing new here to you. Mainly just me catching up...

To your first point about GetEnumerator being somewhat hidden/opaque, isn't that just something that is the usual guidance in c#. ie don't mess with the collection from inside a for cos you'll break the enumerator.

In the fairly simple case in the code above you could...
(NB just using TermEnum as it's a simple example)

TermEnum vector = 
foreach (var item in vector)
{
    var x = vector.Term();
    var y = item.Term;  // same thing as 'x'
 vector.DoStuff();
    // but you should probably steer clear of using 'Next()'
}

There's no need to extend/cast or get access to the enumerator. The enumerators job is to simply move a cursor forwards and support a language construct.
Of course this would make it still look weird to a c#/vb dev.
Current is supposed to return the current value of the cursor. Maybe it could return Current as the collection ('vector' above). So it would look like...

TermEnum vector = 
foreach (TermEnum item in vector)
{
    var x = item.Term();
}

Again, weird. Doing the method -> property change would help. But, still feels odd to have the item being the same thing as vector.

Here lies the basic problem. The Lucene "iterable" classes represent all of:

Also, the need to allow the Current item to be; created on each iteration; reused; fetch from object pool; ...
I'm fairly sure that the EnumEnumerator (or versions of) can handle all that. I'd need to get further into the use cases.

Lastly, there's the observation that the while vs foreach examples you gave are not equivalent as the while version does not call Dispose().
I suspect that in current usage Dispose is never called. Or if it is it's because it's associated with files rather than because we're leaving a loop. Does that match your experience?
Thinking about it, maybe the enumerator Dispose should be a noop? It's odd for a type to be disposing it's "parent" (though I know some StreamWriter/Reader classes do, it's still odd/annoying). Again there's a scoping concept here that's muddled by the type being many things.

All very Tricky...

I think there's some combination of

It's going to be very hard to get the semantics of each type in a sane shape such that it's hard for a dev to get tripped up.
Which is why this have been open for 5+ years

I'll see if I can put together a proof of concept PR

by andypook

bongohrtech commented 7 years ago

No problem. I am just a bit surprised given the choice for some prime green field development vs this headache that you would choose this path. But nonetheless, my aching head thanks you .

By now you have probably worked out that IBytesRefIterator should inherit IEnumerator and that implementing that interface isn't much more complicated than dividing the 1 BytesRef Next(); method into bool MoveNext(); and BytesRef Current { get;}. The other fun stuff such as SeekCeil() and SeekExact() most likely won't be affected. And yes, this means the Dispose() method should be a no-op in our case, but who knows, maybe it won't be in a subclass somewhere - just make sure to put in the dispose pattern so we are covered.

The only real trick is ensuring the other methods on TermsEnum and its subclasses are exposed when implementing the IEnumerable on Terms (and anybody else using TermsEnum). As I mentioned before, I think overloading GetEnumerator() with a method that returns the TermsEnum type is the solution there. But, that probably doesn't take into account any subclasses of Terms that have a different type of iterator returned from their GetIterator() method than the abstract class TermsEnum.

Lastly, there's the observation that the while vs foreach examples you gave are not equivalent as the while version does not call Dispose().

That is just because TermsEnum doesn't implement IDisposable yet. Of course, that all changes when implementing IEnumerator...

And there are several places throughout the codebase where enumerators are utilized without a using block or calling Dispose(). The tests aren't such a big concern, but we really shouldn't be doing this in the production code. I am sure there are some places where you can put your own implementation with its own unmanaged resources and watch the resource leaks pile up because we aren't doing our job and calling Dispose(). The real trick is locating them. I really wish Microsoft would put a feature in Visual Studio that searches for code that should exist but doesn't .

Then again, I haven't been using Code Analysis. Maybe that (or Resharper?) can be used to quickly detect all of the Dispose() violations. If you can find them, maybe that is a PR that should go ahead of the refactoring.

It's odd for a type to be disposing it's "parent" (though I know some StreamWriter/Reader classes do, it's still odd/annoying).

Agreed - when using Dispose() in conjunction with the decorator pattern things get odd. But how does that apply to TermsEnum? Disposing a superclass is not quite the same thing (and is rather expected using the dispose pattern).

All very Tricky...

I think there's some combination of

  1. isolating the "item" type
  2. using the collection object behaviors from inside the loop
  3. implementing the enumerator such that it is ok about the cursor being changed from outside

It's going to be very hard to get the semantics of each type in a sane shape such that it's hard for a dev to get tripped up.

In Java the "item" type is usually the one that is returned from the Next() method. But the real issue is all of the "extras" that come along with these iterators and how to expose them without casting the result of GetEnumerator(). Keep in mind the "extras" (ala LINQ) are usually part of the IEnumerable not part of the IEnumerator. Maybe trying to expose the "extras" is the wrong approach altogether. Maybe those "extras" are crying out to be refactored into services of their own that are separate from the enumerator.

If you go that route, note that because we are a port we had to redefine what "maintainable" means. So, when refactoring a class into many, the typical approach I follow is to drop all of that code into the same code file that is being refactored (in the same order as much as possible). This will make it easy to figure out where that code is when upgrading to the next Lucene version. The standard approach of putting each type into a file on its own would be chaos for a port version upgrade.

But you seem to have a handle on things. If there is a solution to be found here, I am sure you will be able to find it. If you need a sounding board to run your ideas by, I am here.

Which is why this have been open for 5+ years

Yea, I bet the decision to wait until the "next release" was a decision that came with some regret. Unfortunately, Lucene.Net struggled to get over the hump of getting a 4.x release because the project grew by a factor of 10. We are almost there now, and the path to catch up to Lucene 6.x is now clear considering how much smaller the change set is between here and there than the monster change set between 3.x and 4.x.

by nightowl888

bongohrtech commented 7 years ago

Yeah, well, I thought, that's looks like it might be isolated, small. Ha!
I'll poke at it some more, see what turns up. But I may just punt and look for something else

I'm not sure I'm on the same wave length just yet...

"how does that apply to TermsEnum? Disposing a superclass is not quite the same thing"
But the enumerator is not a super class. It's a "child" of the TermsEnum (or whatever) created for a purpose. Disposing the child shouldn't dispose the "parent". I'm fairly sure that IEnumerator has Dispose so that things like listing files (managed resource) can be cleaned up. But still, it's about cleaning up resources used by the enumerator not by the parent collection type.
If you want to Dispose the collection type then that's a separate operation. There may be other behaviors available, appropriate on the collection type after you've looped some aspect of it.
Right ??

Also a bit confused around the "extras" topic. I don't see the problem. We'd get all the linq bits for free. You're right, they are generally defined on IEnumerable But that's fine. In your example of TermsEnum being IEnumerable then regardless of how the enumerator is factored you can still do foreach(var item in termsEnum.Where(b=>b.Length>100))

I was thinking by "extras" you meant "other things on the collection type". By my theory those are either: part of the item and exposed there; or an op you call in the collection object from inside the loop.

What am I missing?

by andypook

bongohrtech commented 7 years ago

Ahh...I think I now know where we are getting our wires crossed.

You have become a victim of the Java developers' incessant need to abbreviate everything (which annoys the crap out of me, but unfortunately fixing it is a losing battle).

Essentially, TermsEnum == IEnumerator, TermsEnum != IEnumerable. "Enum" could be interpreted as either one, hence the misunderstanding (hell, come to think of it, it could even be misinterpreted as an enumeration).

So, rather than loading up the IEnermerable with all kinds of useful stuff like LINQ does, our Lucene friends have loaded up a IEnumerator (or not - they didn't actually implement Java's analogue to this) with stuff that most IEnumerator s don't do. Which of course is leaving us scratching our heads.

So, when disposing TermsEnum, we are doing the standard IEnumerator.Dispose() stuff that the foreach loop does not some funky other relationship. The Terms class is essentially the IEnumerable> that we are enumerating over.

by nightowl888

bongohrtech commented 7 years ago

Yes, agreed. I think I wasn't quite using the right words to disambiguate.

So... take a breath... lots of Esomething words coming....

Yes, the Enum types have the characteristics of a c# IEnumerator (Next() etc).
But also represent the current item (via Term etc).
But also has additional "collection" type operations (SeekExact() etc) (are these part of what you were referring to as "extras"?).
It's this Enum class I was calling the "collection" type 'cos it kind of holds the collection state.

What I was proposing was to make the Enum also look like an IEnumerable so that you can foreach over it.
Which means it needs a GetEnumerable() to return an IEnumerator. This new enumerator is the EnumEnumerator type from in previous messages.

So, I'm taking an Enum, adding an IEnumerable facade, to return a new IEnumerable type which pokes the Enum to move forward.

It was the Dispose() of this new IEnumerator that I was referring to. Pretty convinced that this one should always be a no-op. Exiting the loop should not Dispose the Enum (ie the "collection") and as the new IEnumerator is just poking methods on the Enum, there's nothing for it to clean up.

Pretty sure I've got all the right E's in the right place

NB: some thought would be needed to define what the "item" class (representing Current) should look like in each case.
NB: Calling a Current changing method (ie one of the Seek...() methods) from inside a foreach would result in undefined behavior.

Phew... Did that make sense; fit with your view of the universe?

by andypook

bongohrtech commented 7 years ago

It's this Enum class I was calling the "collection" type 'cos it kind of holds the collection state.

What I was proposing was to make the Enum also look like an IEnumerable so that you can foreach over it.
Which means it needs a GetEnumerable() to return an IEnumerator. This new enumerator is the EnumEnumerator type from in previous messages.

So, I'm taking an Enum, adding an IEnumerable facade, to return a new IEnumerable type which pokes the Enum to move forward.

That is an interesting idea.

I am not sure if the state is actually being "iterated over" so much as it is being calculated on the fly. But then, IEnumerable is just an interface - it doesn't actually define any behavior. In fact, I had to redefine what the key and value ICollection s do in CharArrayMap's KeyCollection and ValueCollection, so maybe this is also possible.

> NB: some thought would be needed to define what the "item" class (representing Current) should look like in each case.
> NB: Calling a Current changing method (ie one of the Seek...() methods) from inside a foreach would result in undefined behavior.

So then we end up with some behavior that doesn't fit the mold.

I thought for a moment that it could be possible these SeekCeil and SeekExact methods are only defined here because there are no such thing as extension methods in Java, but since they are overridden in subclasses, that is probably not a valid assumption.

Speaking of which, I suggest you have a look at how they are used in the TermsEnum overloads in the codecs (for example [Lucene40TermVectorsReader](https://github.com/apache/lucenenet/blob/468199e3fa95c7e1f77b14e7f00aeaafd7c2f8b9/src/Lucene.Net/Codecs/Lucene40/Lucene40TermVectorsReader.cs#L446)) or [Lucene45DocValuesProducer.TermsEnumAnonymousInnerClassHelper](https://github.com/apache/lucenenet/blob/468199e3fa95c7e1f77b14e7f00aeaafd7c2f8b9/src/Lucene.Net/Codecs/Lucene45/Lucene45DocValuesProducer.cs#L1090).

As you can see, the complexity of these enums becomes quite high (lots of internal state), and it seems to me it would be simpler to port future changes from Lucene if the general structure of the "extras" (that is, the properties and Seek* methods) were left alone (or at least left intact and used by some other type of facade). The DocsAndPositionsEnum s are even more complex than this.

The only thing I was hoping to get out of this was the added ability to use a foreach loop on Terms (and possibly the DocIdSetIterator). Sure we could refactor the TermsEnum data structure to be more .NET-like, but then where would we be when we need to connect the dots on the next Lucene upgrade? I am not necessarily against it, but finding bugs in the codec code is no picnic, so we would need some clear way to ensure future codecs behavior can be mapped onto the new structure without causing too much pain. Keep in mind, the codecs change in every major (and nearly every minor) version.

Phew... Did that make sense; fit with your view of the universe?

Wow, do I sound so closed minded? I think I need to work on my delivery .

by nightowl888

bongohrtech commented 7 years ago

Wow, do I sound so closed minded?

Ha. No not at all more worried I just not seeing the whole picture.

not sure if the state is actually being "iterated over" ...

Well, in so far as Next moves a set of values one step forward, it can certainly be thought of that way. I think of these like a cursor over a sql table. You can foreach over the rows. How the data is actually retrieved is abstracted. It could be one at a time, batches, all of them, cached... At the foreach level you just call Next to inspect the values of the next item.

IEnumerable is just an interface - it doesn't actually define any behavior.

Right, this is one of those language syntax enabling things. You actually don't need to specify the interface. As long as the class implements GetEnumerator() then foreach will work.

So then we end up with some behavior that doesn't fit the mold.

True. But that's completely normal.
Imagine an enumerator over a Stream that pulls a line of text at a time. What would happen if you called myStream.Seek(...) inside the loop? Or a foreach over a Dictionary and call myDictionary.Add(...)? The behavior is to either throw (because the Dictionary.Enumerator protects against this) or some undefined behavior depending on which Stream/Reader or underlying storage is being used. Having overridden implementations with different behaviors is also to be expected. Many hours of fun to be had following Stream/TextReader etc. There's no guarantee that descendant classes will have similar behavior at he edges.

My theory was to enable foreach that would produce items representing the values of the Enum that change when its Next is called. This would be entirely an overlay/facade. All the existing members would remain presenting a dual api.
This way copy/pasting code from Java is still possible (modulo c# style properties). But the while style can be refactored to foreach. Or green field dev can be done in whichever style the dev likes.

All linq style stuff will work. I don't think the "extra" (Seek etc) are a problem. Most of them aren't composable as they return a single item or perfrom some op over the "collection". I mean for example you probably wouldn't call SeekExact from inside a foreach, in the same way you would use a dictionary indexer if you were foreach'ing over it. Or call Sort inside a foreach over a List.
Those that are composable (ie the various Docs methods) will work (assuming the DocsEnum is converted. So you could image terms.Docs(...).Where(d=>...).Select(...) working. But I would expect them to be at the beginning of the chain.
I guess you could image a few extra linq style extensions for special cases. But once you're into enumerable land it does cause some restrictions. Unless you want to get into rewriting expressions trees (runs away screaming).

You are quite right, trying to fit this in such that it's consistent across the class hierarchies is going to be a challenge for sure

I'm away this weekend but I'll try to put together a PR with a few examples/tests new week to see if it'll work out.

by andypook

bongohrtech commented 7 years ago

GitHub user AndyPook opened a pull request:

https://github.com/apache/lucenenet/pull/212

GH-469 - enumerables

See https://issues.apache.org/jira/projects/LUCENENET/issues/GH-469 for background.

This is an early stage proposal for implementing IEnumerable on various "Enum" types. To enable `foreach` and linq style operators.

This prototype provides `EnumEnumerator` as a helper. It can be used in various ways to make the implementation of `IEnumerable` more straightforward.

I have modified `IBytesRefIterator` and `TermsEnum` to add `IEnumerable` and therefore to all of it's descendants.

The test (such as it is) demonstrates retrieving a `TermsEnum` from a reader. Then using `foreach` and a simple `.Count()` linq operator.

If this approach is acceptable it ought to be adaptable to other Enum types

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AndyPook/lucenenet GH-469

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/lucenenet/pull/212.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #212


commit 008df498bcba7e9940ca97efc83d656ddb03b2c3
Author: Andy Pook
Date: 2017-08-14T21:32:45Z

simple editorconfig

ensures spaces for indent

commit 59646be0cf8fbc7e8ad64f395f451a11aca57fc6
Author: Andy Pook
Date: 2017-08-14T21:34:48Z

add EnumEnumerator helper

commit 526ddaf8c837f82cd141324e922372155914701b
Author: Andy Pook
Date: 2017-08-14T21:35:32Z

add IEnumerable to TermsEnum/IBytesRefArray

commit 456df8e8347ef67055b712722b50cc3b1bbf26cc
Author: Andy Pook
Date: 2017-08-14T21:36:54Z

add header, fix usings

commit 1a1c25da7cbcd112d12d6c418c669786cdf4088c
Author: Andy Pook
Date: 2017-08-14T22:25:15Z

add example test


by githubbot

bongohrtech commented 7 years ago

added PR https://github.com/apache/lucenenet/pull/212 which I am sure will need a lot of work. The intent at this point is to just demonstrate the approach

by andypook

bongohrtech commented 5 years ago

Removing fix version of 4.8. This is a general lucene.net task and we should not delay 4.8 release.

by laimis