ballerina-platform / ballerina-spec

Ballerina Language and Platform Specifications
Other
168 stars 53 forks source link

How to write a DataLoader in Ballerina to solve GraphQL N+1 problem #1228

Open MohamedSabthar opened 1 year ago

MohamedSabthar commented 1 year ago

The problem with GraphQL and N+1 queries

The N+1 problem is a performance issue that can occur when querying data from a GraphQL API. It arises when a single GraphQL query triggers multiple database queries, resulting in significant overhead and reduced performance. The problem is due to the flexible data model of GraphQL, which allows clients to specify the exact fields they want to retrieve in a query. To avoid the N+1 problem, the DataLoader library provides mechanisms like batching and caching which allows multiple database queries to be executed in a single round trip.

Example javascript graphql code to demonstrate N+1 queries
import { ApolloServer } from "@apollo/server";
import { startStandaloneServer } from "@apollo/server/standalone";
import query from "./query.js";
import mysql from "mysql2";

const typeDefs = `#graphql
  type Query {
    authors: [Author!]!
  }

  type Book {
    id: Int!
    title: String!
  }

  type Author {
    name: String!
    books: [Book!]!
  }
`;

const resolvers = {
  Query: {
    authors: () => {
      const sql = `SELECT * FROM authors`;
      console.log(mysql.format(sql));
      return query(sql);
    },
  },
  Author: {
    books: (author) => {
      const sql = `SELECT * FROM books WHERE author = ?`;
      const params = [author.id];
      console.log(mysql.format(sql, params));
      return query(sql, params);
    },
  },
};

const server = new ApolloServer({
  typeDefs,
  resolvers,
});

const { url } = await startStandaloneServer(server, {
  listen: { port: 9090 },
});

console.log(`🚀  Server ready at: ${url}`);

The full code for this example can be found here

Querying the following graphql query operation

query Query {
  authors {
    books {
      title
    }
  }
}

for the above javascript service will result in following following N+1 db queries

SELECT * FROM authors              # returns 10 authors
SELECT * FROM books WHERE author = 1
SELECT * FROM books WHERE author = 2
SELECT * FROM books WHERE author = 3
SELECT * FROM books WHERE author = 4
SELECT * FROM books WHERE author = 5
SELECT * FROM books WHERE author = 6
SELECT * FROM books WHERE author = 7
SELECT * FROM books WHERE author = 8
SELECT * FROM books WHERE author = 9
SELECT * FROM books WHERE author = 10 
# a separate query is executed foreach author therefore N+1 queries

How does the DataLoader library work?

The javascript DataLoader library works by allowing a GraphQL resolver function to defer a database query and return a promise object to the GraphQL server. The DataLoader library then collects all deferred queries from the resolver functions and batches them into a single round trip to the database.

When the batched queries are executed, DataLoader caches the results and returns the resolved promise objects to the original resolver functions. This ensures that each query is executed only once, and subsequent requests for the same data are served from the cache.

Example javascript graphql code to demonstrate batching with DataLoader

// This is a user defined batch function provided to the DataLoader constructor
async function batchBooks(authorIds) {
  const sql = `SELECT * FROM books WHERE author IN (?)`;
  const params = [authorIds];
  const books = await query(sql, params);
  const grouped = ramda.groupBy((book) => book.author, books);
  return ramda.map((id) => grouped[id] || [], authorIds);
}

const resolvers = {
  Query: {
    authors: () => {
      const sql = `SELECT * FROM authors`;
      console.log(mysql.format(sql));
      return query(sql);
    },
  },
  Author: {
    books: ({ id }, _, { bookLoader }) => bookLoader.load(id).then((books)=> books),
    // Following commented code is an alternative to the above line.
    // books: async ({ id }, _, { bookLoader }) => {
    //   var books = await bookLoader.load(id);
    //   return books;
    // },
  },
};

const { url } = await startStandaloneServer(server, {
  listen: { port: 9090 },

  context: () => {
    return {
      bookLoader: new DataLoader((keys) => batchBooks(keys))
    };
  },
})

console.log(`🚀  Server ready at: ${url}`);

The full code for this example can be found here

Load method

The load method of the DataLoader takes a single key as input and returns a Promise that resolves to the corresponding value for that key. If the value for the key is not already in the cache, DataLoader will queue a new batched query for that key, which will be dispatched at the end of the current event loop.

BatchLoad function

In the example, the batchBooks function is a function defined by the user, which serves as the BatchLoad function that is supplied as the first argument to the DataLoader constructor. Its purpose is to batch and load data for multiple keys simultaneously. Whenever a load method is invoked with a particular key, the DataLoader instance queues up that key for deferred database query execution. Then, all of the queued keys are sent to the batchLoad function upon DataLoader batch dispatch.

querying the same graphql query operation as above will result in following db queries

SELECT * FROM authors   # returns 10 authors
SELECT * FROM books WHERE author IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# all books are fetched in a single query using the DataLoader batching mechanism

When does the DataLoader dispatch a batched (database) query?

DataLoader dispatches a batch of deferred queries to the underlying data source when the current execution context is about to be completed. This ensures that DataLoader has collected all deferred queries from resolver functions and can batch them into a single round trip (using the user written batch function) to the database. The batch dispatching mechanism is triggered automatically by DataLoader and dispatch logic is hidden from the resolvers.

Is there a way to achieve this functionality in Ballerina ?

MohamedSabthar commented 1 year ago

Any clues @hasithaa @sameerajayasoma @jclark ?

jclark commented 1 year ago

Similar to #697

jclark commented 1 year ago

I think this comment and following comments outline a way (adding a prefetch method) to handle this.

ThisaruGuruge commented 1 year ago

Hi @jclark

AFAIU, the mentioned issue is not exactly the same as the issue we discuss here.

The #697 is an optimization on how to retrieve data from a data store. Originally, we need two DB queries to retrieve data, one to retrieve Customer data, and one to retrieve Shipper data. This even applies to REST APIs where we need two calls to retrieve both entities. In GraphQL there’s room for improvement since a single GraphQL request contains all the fields that the client requested, so we might reduce the number of DB calls.

The issue we mentioned here is different, and it is an inherent issue with GraphQL schemas. The issue occurs when there are two or more levels in the GraphQL schema where we need to retrieve the information on the top-level in order to retrieve the second-level entities. For example, in the example mentioned in the issue, we need to retrieve the Author information first, and based on the retrieved Authors, we need to retrieve the Books. In this case, we have no prior knowledge of what we are to resolve until we hit the top-level resource. Therefore, for each resolved Author, we have to call DB again to retrieve Books. Therefore, if there are N authors, we have to call the DB N+1 times, first time to retrieve all the Authors, then one for each Author.

Therefore, if I understood this correctly, we do not have any prior knowledge of what to resolve at the second-level, there’s no way to reduce these calls without having something like [dataloader](https://github.com/graphql/dataloader), which is intended to solve the N+1 problem.

I might’ve not correctly understood the prefetch resources discussed in #697. Can we have a call to discuss this further and come to a resolution?

jclark commented 1 year ago

@ThisaruGuruge I agree the case you mentioned is different and cannot be solved just by the prefetch idea.

I guess you would need to do something similar to the JavaScript: the resource functions would return objects (which don't actually do any database access); then you call a method on the top-level object to go create a query and access the database.

I am happy to have a call to discuss if necessary.

MohamedSabthar commented 1 year ago

I am happy to have a call to discuss if necessary.

Hi @jclark We have scheduled a call on Tuesday, April 25 · 12:30 – 1:30pm (Time zone: Asia/Colombo) to discuss this issue

jclark commented 1 year ago

@MohamedSabthar Please send invites via email not issues (I don't think I'm available then).

shafreenAnfar commented 1 year ago

We (@ThisaruGuruge, @MohamedSabthar and @shafreenAnfar) had a meeting and came up with a possible solution for the problem.

Below is the aforementioned program written in Ballerina without the DataLoader solution.

import ballerina/graphql;
import ballerina/sql;

service on new graphql:Listener(9090) {
   resource function get authors(int[] ids) returns Author[]|error {
       var query = sql:queryConcat(`SELECT * FROM authors WHERE id IN (`, sql:arrayFlattenQuery(ids), `)`);
       stream<AuthorRow, sql:Error?> authorStream = dbClient->query(query);
       return from AuthorRow authorRow in authorStream
           select new (authorRow);
   }
}

isolated distinct service class Author {
   private final readonly & AuthorRow author;

   isolated function init(AuthorRow author) {
       self.author = author.cloneReadOnly();
   }

   isolated resource function get name() returns string {
       return self.author.name;
   }

   isolated resource function get books() returns Book[]|error {
       int authorId = self.author.id;
       var query = sql:queryConcat(`SELECT * FROM books WHERE author = ${authorId}`);
       stream<BookRow, sql:Error?> bookStream = dbClient->query(query);
       return from BookRow bookRow in bookStream
           select new Book(bookRow);
   }
}

isolated distinct service class Book {
   private final readonly & BookRow book;

   isolated function init(BookRow book) {
       self.book = book.cloneReadOnly();
   }

   isolated resource function get id() returns int {
       return self.book.id;
   }

   isolated resource function get title() returns string {
       return self.book.title;
   }
}

Here the n+1 problem is related to the below function.

isolated resource function get books() returns Book[]|error 

Below is the same program written in Ballerina with the possible DataLoader solution.

import ballerina/graphql;
import ballerina/sql;

service on new graphql:Listener(9090) {
   resource function get authors(int[] ids) returns Author[]|error {
       var query = sql:queryConcat(`SELECT * FROM authors WHERE id IN (`, sql:arrayFlattenQuery(ids), `)`);
       stream<AuthorRow, sql:Error?> authorStream = dbClient->query(query);
       return from AuthorRow authorRow in authorStream
           select new (authorRow);
   }
}

isolated distinct service class Author {
   private final readonly & AuthorRow author;

   isolated function init(AuthorRow author) {
       self.author = author.cloneReadOnly();
   }

   isolated resource function get name() returns string {
       return self.author.name;
   }

   isolated resource function get books(DataLoader bookloader) returns Book[]|error {
       BookRow[] bookrows = check bookloader.get(self.author.id);
       return from BookRow bookRow in bookrows select new Book(bookRow);
   }

   @Loader {
       batchFuntion: bookLoaderFunction
   }
   isolated resource function get loadBooks(DataLoader bookLoader) {
       bookLoader.load(self.author.id);
   }
}

function (anydata[] ids) returns anydata[]|error bookLoaderFunction = function (anydata[] ids) returns BookRow[][]|error {
   var query = sql:queryConcat(`SELECT * FROM books WHERE id IN (`, sql:arrayFlattenQuery(<int[]>ids), `)`);
   stream<BookRow, sql:Error?> bookStream = dbClient->query(query);
   map<BookRow[]> authorsBooks = {};
   checkpanic from BookRow bookRow in bookStream
           do {
               string key = bookRow.author.toString();
               if !authorsBooks.hasKey(key) {
                   authorsBooks[key] = [];
               }
               authorsBooks.get(key).push(bookRow);
           };
   return ids.'map(key => authorsBooks.hasKey(key.toString()) ? authorsBooks.get(key.toString()) : []);
};

How it works is as follows.

  1. GraphQL listener searches for associated loader function for each resource. This is done by looking for the pattern loadXXX.
  2. Once it finds the association it creates a DataLoader using the provided batch loader function in @Loader annotation.
  3. The same DataLoader instance is made available for book and loadBook functions.
  4. During the first phase GraphQL engine as usual executes all the resource functions at the same level except for books function. Instead, in this case, the engine executes the loadBooks function.
  5. Once the first phase is completed it executes the dispatch() function of all the DataLoaders. Then in the second phase the engine executes the books resource and completes the current level. Then it moves onto executing the following level.
  6. As per @MohamedSabthar, the engine does breadth-first (level-order) traversal, so at each level, it should repeat 4 and 5.

Following is the type definition for the @Loader

public type LoaderRecord record {
   string id?; // if not provided, Listener will generate one
   function (anydata[] ids) returns anydata[]|error batchFuntion;
};
public annotation LoaderRecord Loader on function;

Following is the type definition and the class for DataLoader.

type DataLoader object {
   isolated function load(anydata id);
   isolated function get(anydata id, typedesc<any> t = <>) returns t|error;
   isolated function dispatch() returns error?;
};

isolated class DefaultDataLoader {
   *DataLoader;

   private final anydata[] ids = [];
   private anydata[] loaderResult = [];
   private final (isolated function (anydata[] ids) returns anydata[]|error) loaderFunction;

   isolated function init(isolated function (anydata[] ids) returns anydata[]|error loadFunction) {
       self.loaderFunction = loadFunction;
   }

   isolated function load(anydata id) {
       lock {
           self.ids.push(id.clone());
       }
   }

   // using the loadedResult array of arrays, return the result for the given id
   isolated function get(anydata id, typedesc<any> t = <>) returns t|error = external;

   isolated function dispatch() returns error? {
       lock {
           self.loaderResult = check self.loaderFunction(self.ids.clone());
           // implement rest of the logic
       }
   }
}

The same code can be found in Github repo gql-loader-samples. We are considering of doing a PoC to see if it actually works.

MohamedSabthar commented 1 year ago

We implemented the above suggestion as a proof of concept (POC) with some minor modifications, and it works! We were able to solve the problem using the below mentioned algorithm, where we break the query into subproblems and then construct the value for the query by solving the subproblems.

Here is a tree diagram that demonstrates how we solve the problem at a high level: 16334

Here is the high-level algorithm:

  1. For each field in the query, the GraphQL engine searches for the associated resource function.
  2. If there is a resource function with the pattern loadXXX (where XXX is the field name), then the engine:
    • Creates a DataLoader using the provided batch loader function in the @Loader annotation.
    • Makes the same DataLoader instance available for the XXX and loadXXX functions.
    • Executes the loadXXX resource method and creates a placeholder value for that field.
  3. Otherwise, the engine executes the corresponding XXX resource function for that field.
  4. Once the above steps are complete, the engine generates a partial value tree with placeholders.
  5. Next, the engine executes the dispatch() function of all created DataLoaders.
  6. Now, for each non-resolved field (placeholder) in the partial value tree:
    • Execute the corresponding resource function (XXX).
    • Obtain the resolved value and replace the placeholder with the resolved value.
    • The resolved value may again be a non-resolved field (placeholder), if so, repeat steps 1-7.
  7. Return the fully constructed value tree.

@jclark We can have a call if you have any suggestions for improvements or alternate approaches ?