OData / AspNetCoreOData

ASP.NET Core OData: A server library built upon ODataLib and ASP.NET Core
Other
447 stars 158 forks source link

Inefficient Pagination Handling in Expanded Entities ($expand ) in Microsoft.AspNetCore.OData #1274

Open andykelvinbrooks opened 3 weeks ago

andykelvinbrooks commented 3 weeks ago

Summary: When using the $expand query option in Microsoft.AspNetCore.OData, all data for the expanded entities is retrieved first, and then pagination is applied in memory. This leads to performance inefficiencies, especially with large datasets.

Steps to Reproduce:

  1. Create an OData controller with a navigation property that can be expanded.
  2. Query the endpoint with $expand and $top or $skip options.
  3. Observe that all data for the expanded entities is retrieved before pagination is applied.

Expected Behavior: Pagination should be applied at the database level when retrieving expanded entities to minimize data retrieval and processing overhead. Actual Behavior: All data for the expanded entities is retrieved first, and then pagination is applied in memory, leading to potential performance issues and high memory usage.

Additional Context: This issue becomes more significant with larger datasets where retrieving all data for expanded entities before applying pagination results in slow response times and high memory consumption. Optimizing this behavior to apply pagination at the database level would greatly enhance performance and efficiency.

Suggested Improvement: Investigate and implement a way to apply pagination ($top and $skip) to expanded entities directly in the database query, ensuring that only the required subset of data is retrieved.

public class MyDbContext : DbContext
{
    public DbSet<Order> Orders { get; set; }
    public DbSet<OrderItem> OrderItems { get; set; }

    protected override void OnModelCreating(ModelBuilder modelBuilder)
    {
        modelBuilder.Entity<Order>()
            .HasMany(o => o.OrderItems)
            .WithOne(i => i.Order)
            .HasForeignKey(i => i.OrderId);
    }
}

public class Order
{
    public int OrderId { get; set; }
    public string CustomerName { get; set; }
    public ICollection<OrderItem> OrderItems { get; set; }
}

public class OrderItem
{
    public int OrderItemId { get; set; }
    public string ProductName { get; set; }
    public int Quantity { get; set; }
    public int OrderId { get; set; }
    public Order Order { get; set; }
}

public class Startup
{
    public void ConfigureServices(IServiceCollection services)
    {
        services.AddDbContext<MyDbContext>(options =>
            options.UseInMemoryDatabase("MyDatabase"));

        services.AddControllers();

        services.AddOData(opt =>
            opt.AddModel("odata", GetEdmModel())
               .Select()
               .Expand()
               .Filter()
               .OrderBy()
               .Count()
               .SetMaxTop(100)
               .SkipToken());
    }

    public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
    {
        app.UseRouting();

        app.UseEndpoints(endpoints =>
        {
            endpoints.MapControllers();
            endpoints.Select().Expand().Filter().OrderBy().Count().MaxTop(100);
            endpoints.MapODataRoute("odata", "odata", GetEdmModel());
        });
    }

    private static IEdmModel GetEdmModel()
    {
        var builder = new ODataConventionModelBuilder();
        builder.EntitySet<Order>("Orders");
        builder.EntitySet<OrderItem>("OrderItems");
        return builder.GetEdmModel();
    }
}

public class OrdersController : ODataController
{
    private readonly MyDbContext _context;

    public OrdersController(MyDbContext context)
    {
        _context = context;
    }

    [EnableQuery]
    public IActionResult Get()
    {
        return Ok(_context.Orders);
    }

    [EnableQuery]
    public IActionResult Get(int key)
    {
        var order = _context.Orders
            .Include(o => o.OrderItems)
            .FirstOrDefault(o => o.OrderId == key);

        if (order == null)
        {
            return NotFound();
        }

        return Ok(order);
    }
}

GET /odata/Orders?$expand=OrderItems&$top=10&$skip=5
ElizabethOkerio commented 3 weeks ago

@andykelvinbrooks I tried this out and this is the IQueryable that OData generates after applying the $top and $skip:

DbSet<Order>()
    .OrderBy($it => $it.OrderId)
    .Select($it => new SelectAllAndExpand<Order>{ 
        Model = TypedLinqParameterContainer<IEdmModel>.TypedProperty, 
        Instance = $it, 
        UseInstanceForProperties = True, 
        Container = new NamedProperty<IEnumerable<SelectAll<OrderItem>>>{ 
            Name = "OrderItems", 
            Value = $it.OrderItems
                .Select($it => new SelectAll<OrderItem>{ 
                    Model = TypedLinqParameterContainer<IEdmModel>.TypedProperty, 
                    Instance = $it, 
                    UseInstanceForProperties = True 
                }
                ) 
        }

    }
    )
    .Skip(TypedLinqParameterContainer<int>.TypedProperty)
    .Take(TypedLinqParameterContainer<int>.TypedProperty)

And this is the query that EfCore generates:

DECLARE @__TypedProperty_2 int = 1;
DECLARE @__TypedProperty_3 int = 2;

SELECT [t].[OrderId], [t].[CustomerName], [o0].[OrderItemId], [o0].[OrderId], [o0].[ProductName], [o0].[Quantity]
FROM (
    SELECT [o].[OrderId], [o].[CustomerName]
    FROM [Orders] AS [o]
    ORDER BY [o].[OrderId]
    OFFSET @__TypedProperty_2 ROWS FETCH NEXT @__TypedProperty_3 ROWS ONLY
) AS [t]
LEFT JOIN [OrderItems] AS [o0] ON [t].[OrderId] = [o0].[OrderId]
ORDER BY [t].[OrderId]

This shows that pagination is applied at the database level. I'm not sure whether the reason you're getting this is because you're using an in-memory database.

Also url GET /odata/Orders?$expand=OrderItems&$top=10&$skip=5 this should apply $top and $skip to the Orders entity.

You could try server side paging using EnableQuery(PageSize=2) or something like that on your controller action methods.

Dhananjaya-Reddy commented 3 weeks ago

Attached sample project which contains local DB with limited sample records.

Sample_OData.zip

When we get the data from http://localhost:5209/invoices?$expand=invoicelines

Hardcoded pagesize as 50 in example

Generated query is :-

Microsoft.EntityFrameworkCore.Database.Command: Information: Executed DbCommand (1ms) [Parameters=[@TypedProperty_3='51', @__TypedProperty_1='51'], CommandType='Text', CommandTimeout='600'] SELECT [t].[InvoiceId] ,[t].[CustomerName] ,[t].[InvoiceDate] ,[t].[TotalAmount] ,[t0].[InvoiceLineId] ,[t0].[InvoiceId] ,[t0].[ProductName] ,[t0].[Quantity] ,[t0].[UnitPrice] FROM ( SELECT TOP (@TypedProperty_3) [v].[InvoiceId] ,[v].[CustomerName] ,[v].[InvoiceDate] ,[v].[TotalAmount] FROM [VIEW_Core_Invoices_V4_MS] AS [v] ORDER BY [v].[InvoiceId] ) AS [t] LEFT JOIN ( SELECT [t1].[InvoiceLineId] ,[t1].[InvoiceId] ,[t1].[ProductName] ,[t1].[Quantity] ,[t1].[UnitPrice] FROM ( SELECT [v0].[InvoiceLineId] ,[v0].[InvoiceId] ,[v0].[ProductName] ,[v0].[Quantity] ,[v0].[UnitPrice] ,ROW_NUMBER() OVER ( PARTITION BY [v0].[InvoiceId] ORDER BY [v0].[InvoiceLineId] ) AS [row] FROM [VIEW_InvoiceLineItem_2_MS] AS [v0] ) AS [t1] WHERE [t1].[row] <= @__TypedProperty_1 ) AS [t0] ON [t].[InvoiceId] = [t0].[InvoiceId] ORDER BY [t].[InvoiceId] ,[t0].[InvoiceId] ,[t0].[InvoiceLineId]

Generated query which I marked as bold(Inner Query) is fetching all the records from child(InvoiceLines) . We have a huge performance in real time as there are millions of records InvoiceLines in real time.

@ElizabethOkerio Kindly let me know your thoughts

ElizabethOkerio commented 2 weeks ago

@Dhananjaya-Reddy Thanks for the repro. I'll take a look and get back to you.

Dhananjaya-Reddy commented 1 week ago

Hi @ElizabethOkerio , Any updates on this?

ElizabethOkerio commented 6 days ago

@Dhananjaya-Reddy I looked at your repro and the shared query above. From the shared query above, it's not true that the inner query fetches all records from child (InvoiceLines). If you look at the query, this part WHERE [t1].[row] <= @__TypedProperty_1 ensures that only 51 rows of the InvoiceLines are retrieved. It is important to note that the PageSize that you set on the EnableQuery - [EnableQuery(PageSize = 50)]is applied to top-level elements and all their nested elements. This means that the response is expected to return a page of 51 invoices and each invoice will have 51 InvoiceLines. To test if this is how this works, you can try with a smaller Pagesize and see how this works. If the Pagesize is smaller than the elements in the data store then you'll get a NextLink that you can use to retrieve the other elements. It is also clear that pagination happens at the server side.

The only issue I'm seeing is with the calculation of row numbers. This part to be exact:


ROW_NUMBER() OVER (
PARTITION BY [v0].[InvoiceId] ORDER BY [v0].[InvoiceLineId]
) AS [row]
FROM [VIEW_InvoiceLineItem_2_MS] AS [v0]
) AS [t1]

If this is being done for millions of rows even if not fetching all of them can be computationally intensive and might affect performance. I think proper indexing on the columns used in the PARTITION BY and ORDER BY clauses can significantly improve the performance of the ROW_NUMBER function.

Dhananjaya-Reddy commented 5 days ago

Hi @ElizabethOkerio , Thank you for your thorough analysis of the query and the repro I shared. Your insights are valuable, and I appreciate the time you've taken to examine the issue in detail. I'd like to address a few points and share some additional thoughts:

Your observations about the query limitation and PageSize behavior are spot-on. This helps clarify how the pagination is intended to work.

This is the core issue I'm worried about. As you pointed out, even though we're not fetching all records, the ROW_NUMBER() function is still processing all rows in the view before any filtering occurs. For a view with millions of rows, this could be a significant performance bottleneck.

The most critical optimization we need to implement is applying filters before the ROW_NUMBER() calculation. This would significantly reduce the number of rows processed, especially for large datasets. Is there a way to restructure the query or modify the OData implementation that don't require major changes to the overall OData implementation to achieve this?

Thank you again for your insightful analysis, and I look forward to your thoughts on this critical optimization.