Proposal: Introduce DataLoader for Ballerina

ThisaruGuruge commented 2 years ago

Summary

DataLoader is a generic utility to be used as a part of the data fetching layer to provide simplified and consistent API over various remote data sources such as databases and web services. In GraphQL, the DataLoader is widely used to overcome the n+1 problem. This proposal is to implement a DataLoader in Ballerina.

Goals

Implement DataLoader as a sub-module of the Ballerina GraphQL package.

Motivation

The `n+1 problem`

The n+1 problem occurs in almost all the GraphQL service implementations. Consider the following GraphQL schema:

type Book {
    title: String!
    author: Author!
}

type Author {
    name: String!
    books: [Book!]!
}

type Query {
    """
    Returns the list of authors
    """
    authors: [Author!]!
}

Then consider the following query on the above schema:

query {
    authors {
        name
        books {
            title
        }
    }
}

When the above document is executed, the GraphQL service first fetches the author list from the data source. But after fetching all the authors, the next resolver have to resolve books for each author. The issue here is the books resolver does not have any information about the other authors' books. It only fetches the books for the given author. This way, there needs to be a backend call for fetching books per author. So, if our data source has 5 authors, there will be a total of 6 (5 (for fetching books per each author + 1 (for fetching all the authors first)) calls to the backend. This is what is called the n+1 problem in GraphQL.

DataLoader

The DataLoader is the solution found by the original developers of the GraphQL spec, where they improved an existing method they used in the Facebook backend to batch and cache data fetching. They have open-sourced it later and it is now used as a reference implementation for DataLoader. Since then the DataLoader implementations have been done in various languages to support solving the n+1 problem. This proposal sometimes uses the words extracted from the reference implementation documentation.

DataLoader in Ballerina

In almost all GraphQL implementations, the DataLoader is a major requirement. Since the Ballerina GraphQL package is now spec-compliant, we are looking for ways to improve the user experience in the Ballerina GrpahQL package. Implementing a DataLoader in Ballerina will improve the user experience drastically. Therefore, this proposal is to implement a DataLoader in Ballerina to provide a better user experience for the users of the Ballerina GraphQL package.

Description

The DataLoader is a batching and caching mechanism for data fetchers from data sources. The users have to provide a batch load function that has an array of keys as the input and returns a future (promise in some other terminologies) of an array of values (and/or errors).

APIs

The DataLoader provides the following APIs for the users.

The `init` Function

This function will initialize a DataLoader instance.

Parameters

batchLoadFunction This function is the function that is being called when the DataLoader needs to fetch the data. It should accept an array of keys, and then returns a future of an array of values. Following is the function signature:
```
public type Key string|int;
function batchLoadFunction(Key[] keys) returns future<(any|error)[]> {
}
```
There are a few constraints this function must uphold:
- The Array of values must be the same length as the Array of keys
- Each index in the Array of values must correspond to the same index in the Array of keys
config This is used to provide additional configurations for the DataLoader. Following is the options record type proposed.
```
public type Config record {|
    boolean batchingEnabled = true;
    boolean cachingEnabled = true;
|};
```
- batchingEnabled: Used to enable/disable batching in the DataLoader. Setting this to false means the batch load function is called for each load.
- cachingEnabled: Used to enable/disable caching in the DataLoader. Setting this to false means that the DataLoader will return a new future value and a new key in the batch load function for the same key if it is loaded multiple times.

Note: The reference implementation has some more options/configurations for the DataLoader such as maxBatchSize, cacheKeyFunction, cacheMap, etc. But those are beyond this proposal and will be added in future iterations.

The `load` Function

This function is used to load data from the DataLoader using a key. It will return a future for the value that corresponds to the provided key or an error.

public isolated function load(Key key, typedesc T = <>) returns future<T|error>

The `loadMany` function

This is used to load multiple values using multiple keys. This will return a future of an array of values.

public isolated function loadMany(Key[] keys, typedesc T[] = <>) returns future<(T|error)[]>

The `clear` Function

This function is used to clear the cache for a given key. This is useful when mutations are going on and the cache needs to be cleared to avoid outdated values returned from the cache. This function will return the DataLoader instance itself for method chaining.

public isolated function clear(Key key) returns DataLoader

The `clearAll` Function

This function is used to clear the cache completely. This can be useful when there's an operation happens which invalidates the whole cache. This function will return the DataLoader instance itself for method chaining.

public isolated function clearAll() returns DataLoader

The `prime` Function

This function is used to prime the cache with provided key and value. This can be useful when a value retrieved from some other method has to be cached to the DataLoader. This function will return the DataLoader instance itself for method chaining.

public isolated function prime(Key key, any Value) returns DataLoader

Note: If the provided key already exists, this will do nothing. To forcefully prime the cache with a key, use the clear function first. See the below example:

DataLoader dataLoader = new(batchLoadFunction);
//...
dataLoader = dataLoader.clear("A").prime("A", {foo: "Foo", bar: "Bar"});

The `dispatch()` Function

The reference implementation is done using NodeJS. NodeJS is single-threaded in nature and it simulates asynchronous logic by invoking functions on separate threads in an event loop. NodeJS generates so-called Ticks in which the queued functions are dispatched for execution, and the reference implementation uses the nextTick() function in NodeJS to automatically dequeue load requests and send them to the batch execution function for processing.

In Ballerina, we do not have such a concept. Therefore, the proposal is proposing to introduce a separate function dispatch(), which can be used manually to dispatch the batch load function. This will provide full control to the developer as to when to dispatch the load function. The developer can attach any logic that determines when a dispatch should take place.

However, this comes with a responsibility. If the developer forgets to call the dispatch function, the batch load function will never be called and the request will hang.

public isolated function dispatch()

Uasge:

dataLoader:DataLoader loader = new (batchLoadFunction);
// ...
loader.dispatch();

Complete Example

Following is a sample code of DataLoader usage in a Ballerina GraphQL service.

import ballerina/graphql;
import ballerina/graphql.dataloader;

public type Address record {|
    int number;
    string street;
    string city;
|};

public type Person record {|
    string name;
    int age;
    Address? address;
|};

function profileLoadFunction(dataloader:Key[] keys) returns future<(Person|error)[]> {
    // Load from database
}

function addressLoadFunction(dataloader:Key[] keys) returns future<(Address|error)[]> {
    // Load from database
}

service on new graphql:Listener(4000) {
    resource function get profiles() returns (Person|error)[]|error {
        dataloader:DataLoader profileLoader = check new(profileLoadFunction);
        Person p1 = check wait profileLoader.load(1);
        Person p2 = check wait profileLoader.load(2);

        dataloader:DataLoader addressLoader = check new(addressLoadFunction);
        Address a1 = check wait addressLoader.load(1);
        Address a2 = check wait addressLoader.load(2);

        p1.address = a1;
        p2.address = a2;

        profileLoader.dispatch();
        addressLoader.dispatch();
        return [p1, p2];
    }
}

There is a small catch here. In the reference implementation, which is written in JavaScript, they use the JavaScript event loop to trigger the load function. They coalesce all individual loads which occur within a single frame of execution (a single tick of the event loop) and then call the batch function with all requested keys. In Ballerina, we have to find a way to handle this, as we do not have access to the Ballerina scheduler.

Note: Even though the DataLoader is mostly used with GraphQL services, it is not a part of the GraphQL specification, and it is also used in some other use cases as well. But In Ballerina, we do not see any other use cases as of yet. Therefore, this proposal suggests implementing it as a submodule of the Ballerina GraphQL package.

Risks and Assumptions

The DataLoader is intended to use per request. The cache used in the DataLoader has no limit and it can grow and consume larger memory. Therefore, it is assumed that a single DataLoader instance is used per request. In the future, we have plans to improve this behavior by providing a way to include custom caching mechanisms to the DataLoader.

Dependencies

We might need some runtime APIs and improvements to handle the scheduling and extracting the exact time to call the batch function.

ThisaruGuruge commented 1 year ago

Since we cannot handle the DataLoader in the way the reference implementation handles the dispatching the load function, the easiest and most obvious way is to use manual dispatching. (This is the way the Java DataLoader handles the case).

Therefore, the proposal will be updated to add a new API in the DataLoader, the dispatch() function, which will allow the user to handle the dispatching of the batch load function. This has its own pros and cons.

With this change, the users can decide when to dispatch the batch load function, so the user has complete control of the logic and the execution. But this also opens the possibility of hanging the service if the user forgets to call the dispatch function. But IMO, it should be fine.

MohamedSabthar commented 1 year ago

New design proposal available here:: https://github.com/ballerina-platform/ballerina-standard-library/issues/4569

ballerina-platform / ballerina-library