Closed ThisaruGuruge closed 1 year ago
Since we cannot handle the DataLoader in the way the reference implementation handles the dispatching the load function, the easiest and most obvious way is to use manual dispatching. (This is the way the Java DataLoader handles the case).
Therefore, the proposal will be updated to add a new API in the DataLoader, the dispatch()
function, which will allow the user to handle the dispatching of the batch load function. This has its own pros and cons.
With this change, the users can decide when to dispatch the batch load function, so the user has complete control of the logic and the execution. But this also opens the possibility of hanging the service if the user forgets to call the dispatch function. But IMO, it should be fine.
New design proposal available here:: https://github.com/ballerina-platform/ballerina-standard-library/issues/4569
Summary
DataLoader is a generic utility to be used as a part of the data fetching layer to provide simplified and consistent API over various remote data sources such as databases and web services. In GraphQL, the DataLoader is widely used to overcome the
n+1 problem
. This proposal is to implement a DataLoader in Ballerina.Goals
Motivation
The
n+1 problem
The
n+1 problem
occurs in almost all the GraphQL service implementations. Consider the following GraphQL schema:Then consider the following query on the above schema:
When the above document is executed, the GraphQL service first fetches the author list from the data source. But after fetching all the authors, the next resolver have to resolve books for each author. The issue here is the books resolver does not have any information about the other authors' books. It only fetches the books for the given author. This way, there needs to be a backend call for fetching books per author. So, if our data source has 5 authors, there will be a total of 6 (5 (for fetching books per each author + 1 (for fetching all the authors first)) calls to the backend. This is what is called the
n+1
problem in GraphQL.DataLoader
The DataLoader is the solution found by the original developers of the GraphQL spec, where they improved an existing method they used in the Facebook backend to batch and cache data fetching. They have open-sourced it later and it is now used as a reference implementation for DataLoader. Since then the DataLoader implementations have been done in various languages to support solving the
n+1
problem. This proposal sometimes uses the words extracted from the reference implementation documentation.DataLoader in Ballerina
In almost all GraphQL implementations, the DataLoader is a major requirement. Since the Ballerina GraphQL package is now spec-compliant, we are looking for ways to improve the user experience in the Ballerina GrpahQL package. Implementing a DataLoader in Ballerina will improve the user experience drastically. Therefore, this proposal is to implement a DataLoader in Ballerina to provide a better user experience for the users of the Ballerina GraphQL package.
Description
The DataLoader is a batching and caching mechanism for data fetchers from data sources. The users have to provide a batch load function that has an array of keys as the input and returns a future (promise in some other terminologies) of an array of values (and/or errors).
APIs
The DataLoader provides the following APIs for the users.
The
init
FunctionThis function will initialize a DataLoader instance.
Parameters
batchLoadFunction
This function is the function that is being called when the DataLoader needs to fetch the data. It should accept an array of keys, and then returns afuture
of an array of values. Following is the function signature:There are a few constraints this function must uphold:
config
This is used to provide additional configurations for the DataLoader. Following is the options record type proposed.batchingEnabled
: Used to enable/disable batching in the DataLoader. Setting this tofalse
means the batch load function is called for each load.cachingEnabled
: Used to enable/disable caching in the DataLoader. Setting this tofalse
means that the DataLoader will return a newfuture
value and a new key in the batch load function for the same key if it is loaded multiple times.Note: The reference implementation has some more options/configurations for the DataLoader such as
maxBatchSize
,cacheKeyFunction
,cacheMap
, etc. But those are beyond this proposal and will be added in future iterations.The
load
FunctionThis function is used to load data from the DataLoader using a key. It will return a
future
for the value that corresponds to the provided key or an error.The
loadMany
functionThis is used to load multiple values using multiple keys. This will return a
future
of an array of values.The
clear
FunctionThis function is used to clear the cache for a given key. This is useful when mutations are going on and the cache needs to be cleared to avoid outdated values returned from the cache. This function will return the DataLoader instance itself for method chaining.
The
clearAll
FunctionThis function is used to clear the cache completely. This can be useful when there's an operation happens which invalidates the whole cache. This function will return the DataLoader instance itself for method chaining.
The
prime
FunctionThis function is used to prime the cache with provided key and value. This can be useful when a value retrieved from some other method has to be cached to the DataLoader. This function will return the DataLoader instance itself for method chaining.
Note: If the provided key already exists, this will do nothing. To forcefully prime the cache with a key, use the
clear
function first. See the below example:The
dispatch()
FunctionThe reference implementation is done using NodeJS. NodeJS is single-threaded in nature and it simulates asynchronous logic by invoking functions on separate threads in an event loop. NodeJS generates so-called
Ticks
in which the queued functions are dispatched for execution, and the reference implementation uses thenextTick()
function in NodeJS to automatically dequeue load requests and send them to the batch execution function for processing.In Ballerina, we do not have such a concept. Therefore, the proposal is proposing to introduce a separate function
dispatch()
, which can be used manually to dispatch the batch load function. This will provide full control to the developer as to when to dispatch the load function. The developer can attach any logic that determines when a dispatch should take place.However, this comes with a responsibility. If the developer forgets to call the
dispatch
function, the batch load function will never be called and the request will hang.Uasge:
Complete Example
Following is a sample code of DataLoader usage in a Ballerina GraphQL service.
There is a small catch here. In the reference implementation, which is written in JavaScript, they use the JavaScript event loop to trigger the load function. They coalesce all individual loads which occur within a single frame of execution (a single tick of the event loop) and then call the batch function with all requested keys. In Ballerina, we have to find a way to handle this, as we do not have access to the Ballerina scheduler.
Note: Even though the DataLoader is mostly used with GraphQL services, it is not a part of the GraphQL specification, and it is also used in some other use cases as well. But In Ballerina, we do not see any other use cases as of yet. Therefore, this proposal suggests implementing it as a submodule of the Ballerina GraphQL package.
Risks and Assumptions
Dependencies