chriseldredge / Lucene.Net.Linq

LINQ provider to run native queries on a Lucene.Net index
Other
151 stars 66 forks source link

Computed Field #79

Closed ChristopherHaws closed 9 years ago

ChristopherHaws commented 9 years ago

I am looking for the ability to have a computed field against an index. I have looked through the different classes currently available and it doesn't appear that there is any way to do this currently. I have written two samples to detail the problem and my proposed solution. If there is a better way to handle this, please let me know! :)

Problem This throws a KeyNotFoundException because the Status field was not added to the queryable fields by the ReflectionDocumentMapper.

void Main()
{
    using(var provider = new LuceneDataProvider(new RAMDirectory(), Version.LUCENE_43))
    {
        using(var session = provider.OpenSession<User>())
        {
            session.Add(
                new User() { Username = "ActiveUser1",      ActiveFrom = DateTime.MinValue,             ActiveUntil = DateTime.MaxValue },
                new User() { Username = "ActiveUser2",      ActiveFrom = DateTime.UtcNow.AddDays(-1),   ActiveUntil = DateTime.MaxValue },
                new User() { Username = "ActiveUser3",      ActiveFrom = DateTime.MinValue,             ActiveUntil = DateTime.UtcNow.AddDays(1) },
                new User() { Username = "InactiveUser1",    ActiveFrom = DateTime.UtcNow.AddDays(1),    ActiveUntil = DateTime.MaxValue },
                new User() { Username = "InactiveUser2",    ActiveFrom = DateTime.MinValue,             ActiveUntil = DateTime.UtcNow.AddDays(-1) }
            );
        }

        var users = provider.AsQueryable<User>();

        var activeUsers = users
            .Where (x => x.Status == "Active")
            .Dump();
    }
}

public class User
{
    [Field("Username", Key = true, Store = StoreMode.Yes)]
    public string Username { get; set; }

    [IgnoreField]
    public string Status
    {
        get
        {
            return this.ActiveFrom <= DateTime.UtcNow && this.ActiveUntil >= DateTime.UtcNow
                ? "Active"
                : "Inactive";
        }
    }

    [NumericField("ActiveFrom", Converter = typeof(DateTimeToTicksConverter))]
    public DateTime ActiveFrom { get; set; }

    [NumericField("ActiveUntil", Converter = typeof(DateTimeToTicksConverter))]
    public DateTime ActiveUntil { get; set; }
}

Proposal I think that a new type of column should be added to the library to give users the ability to add the functionality of an actual field without the field being indexed. To do this there would need to be an attribute called ComputedField that would reference a class that represents the functionality of that field. This would allow filtering and displaying of the value while not having to actually index the field.

void Main()
{
    using(var provider = new LuceneDataProvider(new RAMDirectory(), Version.LUCENE_43))
    {
        using(var session = provider.OpenSession<User>())
        {
            session.Add(
                new User() { Username = "ActiveUser1",      ActiveFrom = DateTime.MinValue,             ActiveUntil = DateTime.MaxValue },
                new User() { Username = "ActiveUser2",      ActiveFrom = DateTime.UtcNow.AddDays(-1),   ActiveUntil = DateTime.MaxValue },
                new User() { Username = "ActiveUser3",      ActiveFrom = DateTime.MinValue,             ActiveUntil = DateTime.UtcNow.AddDays(1) },
                new User() { Username = "InactiveUser1",    ActiveFrom = DateTime.UtcNow.AddDays(1),    ActiveUntil = DateTime.MaxValue },
                new User() { Username = "InactiveUser2",    ActiveFrom = DateTime.MinValue,             ActiveUntil = DateTime.UtcNow.AddDays(-1) }
            );
        }

        var users = provider.AsQueryable<User>();

        var activeUsers = users
            //.Where (x => x.Status == "Active")
            .Dump();
    }
}

public class User
{
    [Field("Username", Key = true, Store = StoreMode.Yes)]
    public string Username { get; set; }

    [ComputedField(typeof(StatusField))]
    public string Status { get; set; }

    [NumericField("ActiveFrom", Converter = typeof(DateTimeToTicksConverter))]
    public DateTime ActiveFrom { get; set; }

    [NumericField("ActiveUntil", Converter = typeof(DateTimeToTicksConverter))]
    public DateTime ActiveUntil { get; set; }
}

[AttributeUsage(AttributeTargets.Property, AllowMultiple = false)]
public class ComputedFieldAttribute : System.Attribute
{
    public ComputedFieldAttribute(Type field)
    {
    }
}

public class StatusField : IComputedField<string, User>
{
    public string GetValue(User record)
    {
        return record.ActiveFrom <= DateTime.UtcNow && record.ActiveUntil >= DateTime.UtcNow
            ? "Active"
            : "Inactive";
    }

    public IEnumerable<User> FilterResults(string value, IEnumerable<User> records)
    {       
        return value == "Active"
            ? records.Where(x => x.ActiveFrom <= DateTime.UtcNow && x.ActiveUntil >= DateTime.UtcNow)
            : records.Where(x => x.ActiveFrom > DateTime.UtcNow && x.ActiveUntil < DateTime.UtcNow);
    }
}

public interface IComputedField<TField, TRecord>
{
    TField GetValue(TRecord record);

    IEnumerable<TRecord> FilterResults(TField value, IEnumerable<TRecord> records);
}
chriseldredge commented 9 years ago

The problem I see with this approach is that since there is no way to translate the computed field into a query that Lucene.Net can execute natively. In order to apply this filter, each document would have to be mapped onto an object, then the computed field would be retrieved, and then the predicate can be executed to apply the filter.

In effect, what this would be doing is:

provider.AsQueryable<User>.All().ToList().Where(u => u.Status == "Active")

Note the ToList() which makes this type of query possible (but not efficient) already.

In effect all documents are retrieved regardless of whether they are ultimately excluded by a Where expression. This will not perform well for an arbitrarily large number of documents.

For this particular example, a more efficient approach would be to build a query expression based on the business logic instead of trying to embed the business logic inside the query.

ChristopherHaws commented 9 years ago

That was the purpose of the FilterResults method. Essentially this field would be an ignore field that can be filtered and viewed using the methods in the ComputedField class. In the FilterResults method, Status is never used, but the class knows how to filter based on actual fields in the lucene index.

chriseldredge commented 9 years ago

Consider this alternative approach:

void Main()
{
    var users = provider.AsQueryable<User>();

    var activeUsers = users
        .ToList()
        .Where (x => x.IsActive)
        .Dump();
}

public class User
{
    [Field("Username", Key = true, Store = StoreMode.Yes)]
    public string Username { get; set; }

    [IgnoreField]
    public bool IsActive
    {
        get
        {
            return this.ActiveFrom <= DateTime.UtcNow && this.ActiveUntil >= DateTime.UtcNow;
        }
    }

    [NumericField("ActiveFrom", Converter = typeof(DateTimeToTicksConverter))]
    public DateTime ActiveFrom { get; set; }

    [NumericField("ActiveUntil", Converter = typeof(DateTimeToTicksConverter))]
    public DateTime ActiveUntil { get; set; }
}

Since the underlying Lucene index will not store any field named Status or IsActive, it doesn't make sense to me to put this type of filtering capability into Lucene.Net.Linq.

You could write your own extension method along the lines of:

internal static class EnumerableExtensions
{
    public static IEnumerable<T> FilterAfterRetrieval<T>(this IEnumerable<T> items, Func<T, bool> filter)
    {
        foreach (var i in  items)
        {
            if (filter(i)) yield return i;
        }
    }
}

I would reiterate that for this will not perform well when you have an arbitrarily large number of items to filter, and you are better off designing your program in such a way that the query can be executed natively by Lucene.Net.

chriseldredge commented 9 years ago

Consider this alternative, which builds a query that can be executed natively by Lucene.Net and would perform better than the proposed solution:

public class User
{
    [Field("Username", Key = true, Store = StoreMode.Yes)]
    public string Username { get; set; }

    [NumericField("ActiveFrom", Converter = typeof(DateTimeToTicksConverter))]
    public DateTime ActiveFrom { get; set; }

    [NumericField("ActiveUntil", Converter = typeof(DateTimeToTicksConverter))]
    public DateTime ActiveUntil { get; set; }
}

internal static class EnumerableUserExtensions
{
    public static IEnumerable<User> WhereUserIsActive(this IEnumerable<User> users)
    {
        var now = DateTime.UtcNow;

        return from u in users
            where u.ActiveFrom <= now && u.ActiveUntil >= now
            select u;
    }
}

public class Program
{
    public void Main()
    {
        var users = Enumerable.Empty<User>();

        var active = users.WhereUserIsActive().ToList();
    }
}
ChristopherHaws commented 9 years ago

This could work for when I need to get these values in code, however I am using OData to return these entities as Queryables to a web frontend and so I don't have too much control over how the queries get executed. Would it maybe make sense to make some kind of interface that looked similar to ReflectionFieldMapper where we could manually create the lucene queries for specific fields?

chriseldredge commented 9 years ago

Are you using the older WCF Data Services or the newer WebApi OData? You will probably have better luck with the latter in terms of controlling and customizing how queries are translated and filtered.

ChristopherHaws commented 9 years ago

Please take a look at the above pull request. I included a sample in the test project called ComputedFieldSample.

chriseldredge commented 9 years ago

For the reasons cited in my previous comment I'm not going to merge this PR.

This library is intended to translate LINQ queries to Lucene so that they can be executed natively.