Proposal: Introduce DataLoader for Ballerina GraphQL

MohamedSabthar commented 1 year ago

Summary

DataLoader is a versatile tool used for accessing various remote data sources in GraphQL. Within the realm of GraphQL, DataLoader is extensively employed to address the N+1 problem. The aim of this proposal is to incorporate a DataLoader functionality into the Ballerina GraphQL package.

Goals

Implement DataLoader as a sub-module of the Ballerina GraphQL package.

Motivation

The N+1 problem

The N+1 problem can be exemplified in a scenario involving authors and their books. Imagine a book catalog application that displays a list of authors and their respective books. When encountering the N+1 problem, retrieving the list of authors requires an initial query to fetch author information (N), followed by separate queries for each author to retrieve their books (1 query per author).

This results in N+1 queries being executed, where N represents the number of authors, leading to increased overhead and potential performance issues. Following is a GraphQL book catalog application written in Ballerina which susceptible to N +1 problem

import ballerina/graphql;
import ballerina/sql;
import ballerina/io;
import ballerinax/java.jdbc;
import ballerinax/mysql.driver as _;

service on new graphql:Listener(9090) {
    resource function get authors() returns Author[]|error {
        var query = sql:queryConcat(`SELECT * FROM authors`);
        io:println(query);
        stream<AuthorRow, sql:Error?> authorStream = dbClient->query(query);
        return from AuthorRow authorRow in authorStream
            select new (authorRow);
    }
}

isolated distinct service class Author {
    private final readonly & AuthorRow author;

    isolated function init(AuthorRow author) {
        self.author = author.cloneReadOnly();
    }

    isolated resource function get name() returns string {
        return self.author.name;
    }

    isolated resource function get books() returns Book[]|error {
        int authorId = self.author.id;
        var query = sql:queryConcat(`SELECT * FROM books WHERE author = ${authorId}`);
        io:println(query);
        stream<BookRow, sql:Error?> bookStream = dbClient->query(query);
        return from BookRow bookRow in bookStream
            select new Book(bookRow);
    }
}

isolated distinct service class Book {
    private final readonly & BookRow book;

    isolated function init(BookRow book) {
        self.book = book.cloneReadOnly();
    }

    isolated resource function get id() returns int {
        return self.book.id;
    }

    isolated resource function get title() returns string {
        return self.book.title;
    }
}

final jdbc:Client dbClient = check new ("jdbc:mysql://localhost:3306/mydatabase", "root", "password");

public type AuthorRow record {
    int id;
    string name;
};

public type BookRow record {
    int id;
    string title;
};

Executing the query

{
  authors {
    name
    books {
      title
    }
  }
}

on the above service will print the following SQL queries in the terminal

SELECT * FROM authors
SELECT * FROM books WHERE author = 10
SELECT * FROM books WHERE author = 9
SELECT * FROM books WHERE author = 8
SELECT * FROM books WHERE author = 7
SELECT * FROM books WHERE author = 6
SELECT * FROM books WHERE author = 5
SELECT * FROM books WHERE author = 4
SELECT * FROM books WHERE author = 3
SELECT * FROM books WHERE author = 2
SELECT * FROM books WHERE author = 1

where the first query returns 10 authors then for each author a separate query is executed to obtain the book details resulting in a total of 11 queries which leads to inefficient database querying. The DataLoader allows us to overcome this problem.

DataLoader

The DataLoader is the solution found by the original developers of the GraphQL spec. The primary purpose of DataLoader is to optimize data fetching and mitigate performance issues, especially the N+1 problem commonly encountered in GraphQL APIs. It achieves this by batching and caching data requests, reducing the number of queries sent to the underlying data sources. DataLoader helps minimize unnecessary overhead and improves the overall efficiency and response time of data retrieval operations.

Success Metrics

In almost all GraphQL implementations, the DataLoader is a major requirement. Since the Ballerina GraphQL package is now spec-compliant, we are looking for ways to improve the user experience in the Ballerina GraphQL package. Implementing a DataLoader in Ballerina will improve the user experience drastically.

Description

The DataLoader batches and caches operations for data fetchers from different data sources.The DataLoader requires users to provide a batch function that accepts an array of keys as input and retrieves the corresponding array of values for those keys.

API

DataLoader object

This object defines the public APIs accessible to users.

public type DataLoader isolated object {
   # Collects a key to perform a batch operation at a later time.
   pubic isolated function load(anydata key);

   # Retrieves the result for a particular key.
   public isolated function get(anydata key, typedesc<anydata> t = <>) returns t|error;

   # Executes the user-defined batch function.
   public isolated function dispatch();
};

DefaultDataLoader class

This class provides a default implementation for the DataLoader

isolated class DefaultDataLoader {
    *DataLoader;

    private final table<Key> key(key) keys = table [];
    private table<Result> key(key) resultTable = table [];
    private final (isolated function (readonly & anydata[] keys) returns anydata[]|error) batchLoadFunction;

    public isolated function init(isolated function (readonly & anydata[] keys) returns anydata[]|error batchLoadFunction) {
        self.batchLoadFunction = batchLoadFunction;
    }

    // … implementations of load, get and dispatch methods
}

type Result record {|
    readonly anydata key;
    anydata|error value; 
|};

The DefaultDataLoader class is an implementation of the DataLoader with the following characteristics:

Inherits from DataLoader.
Maintains a key table to collect keys for batch execution.
Stores/caches results in a resultTable.
Requires an isolated function batchLoadFunction to be provided during initialization.

`init` method

The init method instantiates the DefaultDataLoader and accepts a batchLoadFunction function pointer as a parameter. The batchLoadFunction function pointer has the following type:

isolated function (readonly & anydata[] keys) returns anydata[]|error

Users are expected to define the logic for the batchLoadFunction, which handles the batching of operations. The batchLoadFunction should return an array of anydata where each element corresponds to a key in the input keys array upon successful execution.

`load` method

The load method takes an anydata key parameter and adds it to the key table for batch execution. If a result is already cached for the given key in the result table, the key will not be added to the key table again.

`get` method

The get method takes an anydata key as a parameter and retrieves the associated value by looking up the result in the result table. If a result is found for the given key, this method attempts to perform data binding and returns the result. If a result cannot be found or data binding fails, an error is returned.

`dispatch` method

The dispatch method invokes the user-defined batchLoadFunction. It passes the collected keys as an input array to the batchLoadFunction, retrieves the result array, and stores the key-to-value mapping in the resultTable.

Requirements to Engaging DataLoader in GraphQL Module

To integrate the DataLoader with the GraphQL module, users need to follow these three steps:

Identify the resource method (GraphQL field) that requires the use of the DataLoader. Then, add a new parameter map<dataloader:DataLoader> to its parameter list.
Define a matching remote/resource method called loadXXX, where XXX represents the Pascal-cased name of the GraphQL field identified in the previous step. This method may include all/some of the required parameters from the graphql field and the map<dataloader:DataLoader> parameter. This function is executed as a prefetch step before executing the corresponding resource method of GraphQL field. (Note that both the loadXXX method and the XXX method should have same resource accessor or should be remote methods)
Annotate the loadXXX method written in step two with @dataloader:Loader annotation and pass the required configuration. This annotation helps avoid adding loadXXX as a field in the GraphQL schema and also provides DataLoader configuration.

`Loader` annotation

# Provides a set of configurations for the load resource method.
public type LoaderConfig record {|
      # Facilitates a connection between a data loader key and a batch function. 
      # The data loader key enables the reuse of the same data loader across resolvers
      map<isolated function (readonly & anydata[] keys) returns anydata[]|error> batchFunctions;
|};

# The annotation to configure the load resource method with a DataLoader
public annotation LoaderConfig Loader on object function;

The following section demonstrates the usage of DataLoader in Ballerina GraphQL.

Modifying the Book Catalog Application to Use DataLoader

In the previous Book Catalog Application example SELECT * FROM books WHERE author = ${authorId} was executed each time for N = 10 authors. To batch these database calls to a single request we need to use a DataLoader at the books field. The following code block demonstrates the changes made to the books field and Author service class.

import ballerina/graphql.dataloader;

isolated distinct service class Author {
    private final readonly & AuthorRow author;

    isolated function init(AuthorRow author) {
        self.author = author.cloneReadOnly();
    }

    isolated resource function get name() returns string {
        return self.author.name;
    }

       // 1. Add a map<dataloader:DataLoader> parameter to it’s parameter list
       isolated resource function get books(map<dataloader:DataLoader> loaders) returns Book[]|error {
        dataloader:DataLoader bookLoader = loaders.get("bookLoader");
        BookRow[] bookrows = check bookLoader.get(self.author.id); // get the value from DataLoader for the key
        return from BookRow bookRow in bookrows
            select new Book(bookRow);
    }

    // 3. add dataloader:Loader annotation to the loadXXX method.
    @dataloader:Loader {
        batchFunctions: {"bookLoader": bookLoaderFunction}
    }
    // 2. create a loadXXX method
    isolated resource function get loadBooks(map<dataloader:DataLoader> loaders) {
        dataloader:DataLoader bookLoader = loaders.get("bookLoader");
        bookLoader.load(self.author.id); // pass the key so it can be collected and batched later
    }

}

// User written code to batch the books
isolated function bookLoaderFunction(readonly & anydata[] ids) returns BookRow[][]|error {
    readonly & int[] keys = <readonly & int[]>ids;
    var query = sql:queryConcat(`SELECT * FROM books WHERE author IN (`, sql:arrayFlattenQuery(keys), `)`);
    io:println(query);
    stream<BookRow, sql:Error?> bookStream = dbClient->query(query);
    map<BookRow[]> authorsBooks = {};
    checkpanic from BookRow bookRow in bookStream
        do {
            string key = bookRow.author.toString();
            if !authorsBooks.hasKey(key) {
                authorsBooks[key] = [];
            }
            authorsBooks.get(key).push(bookRow);
        };
    final readonly & map<BookRow[]> clonedMap = authorsBooks.cloneReadOnly();
    return keys.'map(key => clonedMap[key.toString()] ?: []);
};

executing the following query

{
  authors {
    name
    books {
      title
    }
  }

after incorporating DataLoader will now include only two database queries.

SELECT * FROM authors
SELECT * FROM books WHERE author IN (1,2,3,4,5,6,7,8,9,10)

Engaging DataLoader with GraphQL Engine

At a high level the GraphQL Engine breaks the query into subproblems and then constructs the value for the query by solving the subproblems as shown in the below diagram. Following algorithm demonstrates how the GraphQL engine engages the DataLoader at a high level.

The GraphQL engine searches for the associated resource/remote function for each field in the query.
If a matching resource/remote function with the pattern loadXXX (where XXX is the field name) is found, the engine:
- Creates a map of DataLoader instances using the provided batch loader functions in the @dataloader:Loader annotation.
- Makes this map of DataLoader instances available for both the XXX and loadXXX functions.
- Executes the loadXXX resource method and generates a placeholder value for that field.
If no matching loadXXX function is found, the engine executes the corresponding XXX resource function for that field.
After completing the above steps, the engine generates a partial value tree with placeholders.
The engine then executes the dispatch() function of all the created DataLoaders.
For each non-resolved field (placeholder) in the partial value tree:
- Executes the corresponding resource function (XXX).
- Obtains the resolved value and replaces the placeholder with the resolved value.
- If the resolved value is still a non-resolved field (placeholder), the process repeats steps 1-7.
Finally, the fully constructed value tree is returned.

Future Plans

The DataLoader object will be enhanced with the following public methods:

loadMany: This method allows adding multiple keys to the key table for future data loading.
getMany: Given an array of keys, this method retrieves the corresponding values from the DataLoader's result table, returning them as an array of anydata values or an error, if applicable. The return type is (anydata | error)[].
clear: This method takes a key as a parameter and removes the corresponding result from the result table, effectively clearing the cached data for that key.
clearAll: This method removes all cached results from the result table, providing a way to clear the entire cache in one operation.
prime: This method takes a key and an anydata value as arguments and replaces or stores the key-value mapping in the result table. This allows preloading or priming specific data into the cache for efficient retrieval later. These methods enhance the functionality of the DataLoader, providing more flexibility and control over data loading, caching, and result management.

hasithaa commented 1 year ago

Not sure loadXXX is an actual resources method. It is an internal method for pre-processing and not part of the graphql schema.

ThisaruGuruge commented 1 year ago

Not sure loadXXX is an actual resources method. It is an internal method for pre-processing and not part of the graphql schema.

We need to map the exact remote/resource method and the corresponding loader function. Two resources can have different accessors and the same path inside a service. There isn't a way to provide two loader functions in that scenario.

hasithaa commented 1 year ago

I agree. But My point is these functions are internal functions and mapping can do internally. Also, these functions are not part of the GraphQL original schema. By defining these functions as resource functions break the tools such as schema generation, isn't it?

ThisaruGuruge commented 1 year ago

Yes, this is a valid point. Shall we have a meeting to check out the alternatives?

Meantime, we will go ahead with this approach. We are hoping to release this as an experimental feature first. Will that be okay?

MohamedSabthar commented 1 year ago

Had a discussion with @sameerajayasoma @shafreenAnfar @ThisaruGuruge regarding this issue: Following points were discussed in the meeting:

Since batch functions are global, is there a way to associate them with the graphql:Service instead of the loadXXX function? Perhaps with the graphql:ServiceConfig?
Instead of passing the batch function pointer, can't we directly pass the DataLoader instance?
Instead of defining loadXXX as a resource/remote method, can't we have regular methods?
- The current approach of using resource/remote methods (for loadXXX) was employed to uniquely identify the corresponding loadXXX method for a given field. This was necessary because there can be two methods with the same name but can have different resource accessors or remote keywords, resulting in ambiguity.
- To avoid this ambiguity, we can consider annotating the field and passing the corresponding loadXXX method's function pointer via annotation.

MohamedSabthar commented 1 year ago

Following changes will be made to the API according to the meeting with the team (@sameerajayasoma @shafreenAnfar @ThisaruGuruge);

The (prefetch) loadXXX method will be renamed to preXXX.
The preXXX method signature will be changed to a regular method instead of resource/remote methods.
As an advanced case, the user can override the default preXXX method using the resource config annotation. See the example below:

isolated distinct service class Author {
    //...

    isolated function prefetchBooks(graphql:Context ctx) {
        // ...
    }

    @graphql:ResourceConfig {
        // ... other fields
        prefetch: self.prefetchBooks
    }
    isolated resource function get books(graphql:Context ctx) returns Book[]|error {
        // ...
    }

    remote function books(BookInput[] input) returns Book[]|error {
        // ...
    }
}

The @dataloader:Loader annotation will be removed from the API, and the user will be able to register the dataloader in the context object and access the dataloader from the context object. With this API change, the map<dataloader:DataLoader> parameter is removed from both the preXXX and the resolver methods. The following are the two new methods that will be added to the context:
```
public isolated class Context {
// ... omitted for brevity

public isolated function registerDataLoader(string key, dataloader:DataLoader dataLoader) {
    // ...
}

public isolated function getDataLoader(string key) returns dataloader:DataLoader {
    // ...
    // panic if no key found
}
}
```
The load method in the dataloader will be renamed to add:

public isolated function add(anydata key);

Putting it all together, the following example demonstrates the usage of the new API:

Example

@graphql:ServiceConfig {
    contextInit: isolated function (http:RequestContext requestContext, http:Request request) returns graphql:Context {
        graphql:Context context = new;
        context.registerDataLoader("bookLoader", new DefaultDataLoader(batchBooks));
        return context;
    }
}
service on new graphql:Listener(9090) {
    // ... omitted for brevity
}

isolated distinct service class Author {
    //...

    isolated function preBooks(graphql:Context ctx) {
        dataloader:DataLoader bookLoader = ctx.getDataLoader("bookLoader");
        bookLoader.add(self.author.id);
    }

    isolated resource function get books(graphql:Context ctx) returns Book[]|error {
        dataloader:DataLoader bookLoader = ctx.getDataLoader("bookLoader");
        return bookLoader.get(self.author.id);
    }
}

MohamedSabthar commented 1 year ago

As for @MaryamZi's comment, it is currently not possible to pass an instance method reference to the annotation. As an alternative approach, we have considered passing the prefetch method name to the @graphql:ResourceConfig annotation.

Example:

isolated distinct service class Author {
    //...

    isolated function prefetchBooks(graphql:Context ctx) {
        // ...
    }

    @graphql:ResourceConfig {
        // ... other fields
        prefetchMethodName: "prefetchBooks"
    }
    isolated resource function get books(graphql:Context ctx) returns Book[]|error {
        // ...
    }

    remote function books(BookInput[] input) returns Book[]|error {
        // ...
    }
}

We could validate the existence and signature of the "prefetchBooks" at compile time using a compiler plugin. What do you think, @sameerajayasoma?

ballerina-platform / ballerina-library