Open MohamedSabthar opened 1 year ago
Any clues @hasithaa @sameerajayasoma @jclark ?
Similar to #697
I think this comment and following comments outline a way (adding a prefetch method) to handle this.
Hi @jclark
AFAIU, the mentioned issue is not exactly the same as the issue we discuss here.
The #697 is an optimization on how to retrieve data from a data store. Originally, we need two DB queries to retrieve data, one to retrieve Customer
data, and one to retrieve Shipper
data. This even applies to REST APIs where we need two calls to retrieve both entities. In GraphQL there’s room for improvement since a single GraphQL request contains all the fields that the client requested, so we might reduce the number of DB calls.
The issue we mentioned here is different, and it is an inherent issue with GraphQL schemas. The issue occurs when there are two or more levels in the GraphQL schema where we need to retrieve the information on the top-level in order to retrieve the second-level entities. For example, in the example mentioned in the issue, we need to retrieve the Author
information first, and based on the retrieved Author
s, we need to retrieve the Book
s. In this case, we have no prior knowledge of what we are to resolve until we hit the top-level resource. Therefore, for each resolved Author
, we have to call DB again to retrieve Book
s. Therefore, if there are N
authors, we have to call the DB N+1
times, first time to retrieve all the Author
s, then one for each Author
.
Therefore, if I understood this correctly, we do not have any prior knowledge of what to resolve at the second-level, there’s no way to reduce these calls without having something like [dataloader](https://github.com/graphql/dataloader)
, which is intended to solve the N+1
problem.
I might’ve not correctly understood the prefetch resources discussed in #697. Can we have a call to discuss this further and come to a resolution?
@ThisaruGuruge I agree the case you mentioned is different and cannot be solved just by the prefetch
idea.
I guess you would need to do something similar to the JavaScript: the resource functions would return objects (which don't actually do any database access); then you call a method on the top-level object to go create a query and access the database.
I am happy to have a call to discuss if necessary.
I am happy to have a call to discuss if necessary.
Hi @jclark We have scheduled a call on Tuesday, April 25 · 12:30 – 1:30pm (Time zone: Asia/Colombo) to discuss this issue
@MohamedSabthar Please send invites via email not issues (I don't think I'm available then).
We (@ThisaruGuruge, @MohamedSabthar and @shafreenAnfar) had a meeting and came up with a possible solution for the problem.
Below is the aforementioned program written in Ballerina without the DataLoader
solution.
import ballerina/graphql;
import ballerina/sql;
service on new graphql:Listener(9090) {
resource function get authors(int[] ids) returns Author[]|error {
var query = sql:queryConcat(`SELECT * FROM authors WHERE id IN (`, sql:arrayFlattenQuery(ids), `)`);
stream<AuthorRow, sql:Error?> authorStream = dbClient->query(query);
return from AuthorRow authorRow in authorStream
select new (authorRow);
}
}
isolated distinct service class Author {
private final readonly & AuthorRow author;
isolated function init(AuthorRow author) {
self.author = author.cloneReadOnly();
}
isolated resource function get name() returns string {
return self.author.name;
}
isolated resource function get books() returns Book[]|error {
int authorId = self.author.id;
var query = sql:queryConcat(`SELECT * FROM books WHERE author = ${authorId}`);
stream<BookRow, sql:Error?> bookStream = dbClient->query(query);
return from BookRow bookRow in bookStream
select new Book(bookRow);
}
}
isolated distinct service class Book {
private final readonly & BookRow book;
isolated function init(BookRow book) {
self.book = book.cloneReadOnly();
}
isolated resource function get id() returns int {
return self.book.id;
}
isolated resource function get title() returns string {
return self.book.title;
}
}
Here the n+1 problem is related to the below function.
isolated resource function get books() returns Book[]|error
Below is the same program written in Ballerina with the possible DataLoader
solution.
import ballerina/graphql;
import ballerina/sql;
service on new graphql:Listener(9090) {
resource function get authors(int[] ids) returns Author[]|error {
var query = sql:queryConcat(`SELECT * FROM authors WHERE id IN (`, sql:arrayFlattenQuery(ids), `)`);
stream<AuthorRow, sql:Error?> authorStream = dbClient->query(query);
return from AuthorRow authorRow in authorStream
select new (authorRow);
}
}
isolated distinct service class Author {
private final readonly & AuthorRow author;
isolated function init(AuthorRow author) {
self.author = author.cloneReadOnly();
}
isolated resource function get name() returns string {
return self.author.name;
}
isolated resource function get books(DataLoader bookloader) returns Book[]|error {
BookRow[] bookrows = check bookloader.get(self.author.id);
return from BookRow bookRow in bookrows select new Book(bookRow);
}
@Loader {
batchFuntion: bookLoaderFunction
}
isolated resource function get loadBooks(DataLoader bookLoader) {
bookLoader.load(self.author.id);
}
}
function (anydata[] ids) returns anydata[]|error bookLoaderFunction = function (anydata[] ids) returns BookRow[][]|error {
var query = sql:queryConcat(`SELECT * FROM books WHERE id IN (`, sql:arrayFlattenQuery(<int[]>ids), `)`);
stream<BookRow, sql:Error?> bookStream = dbClient->query(query);
map<BookRow[]> authorsBooks = {};
checkpanic from BookRow bookRow in bookStream
do {
string key = bookRow.author.toString();
if !authorsBooks.hasKey(key) {
authorsBooks[key] = [];
}
authorsBooks.get(key).push(bookRow);
};
return ids.'map(key => authorsBooks.hasKey(key.toString()) ? authorsBooks.get(key.toString()) : []);
};
How it works is as follows.
loadXXX
.DataLoader
using the provided batch loader function in @Loader
annotation.DataLoader
instance is made available for book
and loadBook
functions.books
function. Instead, in this case, the engine executes the loadBooks
function.dispatch()
function of all the DataLoaders
. Then
in the second phase the engine executes the books
resource and completes the current level. Then it moves onto executing the following level.Following is the type definition for the @Loader
public type LoaderRecord record {
string id?; // if not provided, Listener will generate one
function (anydata[] ids) returns anydata[]|error batchFuntion;
};
public annotation LoaderRecord Loader on function;
Following is the type definition and the class for DataLoader
.
type DataLoader object {
isolated function load(anydata id);
isolated function get(anydata id, typedesc<any> t = <>) returns t|error;
isolated function dispatch() returns error?;
};
isolated class DefaultDataLoader {
*DataLoader;
private final anydata[] ids = [];
private anydata[] loaderResult = [];
private final (isolated function (anydata[] ids) returns anydata[]|error) loaderFunction;
isolated function init(isolated function (anydata[] ids) returns anydata[]|error loadFunction) {
self.loaderFunction = loadFunction;
}
isolated function load(anydata id) {
lock {
self.ids.push(id.clone());
}
}
// using the loadedResult array of arrays, return the result for the given id
isolated function get(anydata id, typedesc<any> t = <>) returns t|error = external;
isolated function dispatch() returns error? {
lock {
self.loaderResult = check self.loaderFunction(self.ids.clone());
// implement rest of the logic
}
}
}
The same code can be found in Github repo gql-loader-samples. We are considering of doing a PoC to see if it actually works.
We implemented the above suggestion as a proof of concept (POC) with some minor modifications, and it works! We were able to solve the problem using the below mentioned algorithm, where we break the query into subproblems and then construct the value for the query by solving the subproblems.
Here is a tree diagram that demonstrates how we solve the problem at a high level:
Here is the high-level algorithm:
@jclark We can have a call if you have any suggestions for improvements or alternate approaches ?
The problem with GraphQL and N+1 queries
The N+1 problem is a performance issue that can occur when querying data from a GraphQL API. It arises when a single GraphQL query triggers multiple database queries, resulting in significant overhead and reduced performance. The problem is due to the flexible data model of GraphQL, which allows clients to specify the exact fields they want to retrieve in a query. To avoid the N+1 problem, the DataLoader library provides mechanisms like batching and caching which allows multiple database queries to be executed in a single round trip.
Example javascript graphql code to demonstrate N+1 queries
The full code for this example can be found here
Querying the following graphql query operation
for the above javascript service will result in following following N+1 db queries
How does the DataLoader library work?
The javascript DataLoader library works by allowing a GraphQL resolver function to defer a database query and return a
promise
object to the GraphQL server. The DataLoader library then collects all deferred queries from the resolver functions and batches them into a single round trip to the database.When the batched queries are executed, DataLoader caches the results and returns the resolved promise objects to the original resolver functions. This ensures that each query is executed only once, and subsequent requests for the same data are served from the cache.
Example javascript graphql code to demonstrate batching with DataLoader
The full code for this example can be found here
Load method
The
load
method of the DataLoader takes a single key as input and returns a Promise that resolves to the corresponding value for that key. If the value for the key is not already in the cache, DataLoader will queue a new batched query for that key, which will be dispatched at the end of the current event loop.BatchLoad function
In the example, the
batchBooks
function is a function defined by the user, which serves as theBatchLoad
function that is supplied as the first argument to the DataLoader constructor. Its purpose is to batch and load data for multiple keys simultaneously. Whenever a load method is invoked with a particular key, the DataLoader instance queues up that key for deferred database query execution. Then, all of the queued keys are sent to the batchLoad function upon DataLoader batch dispatch.querying the same graphql query operation as above will result in following db queries
When does the DataLoader dispatch a batched (database) query?
DataLoader dispatches a batch of deferred queries to the underlying data source when the current execution context is about to be completed. This ensures that DataLoader has collected all deferred queries from resolver functions and can batch them into a single round trip (using the user written batch function) to the database. The batch dispatching mechanism is triggered automatically by DataLoader and dispatch logic is hidden from the resolvers.
Is there a way to achieve this functionality in Ballerina ?