read BLOBs in chunks - Githubissues

andykais commented 3 years ago

Hi, I am looking into building a file server that stores data in sqlite. Some files are going to be large videos which need to be streamed in chunks to the client.

Sqlite provides a method for reading a byte range of blob columns http://www.sqlite.org/c3ref/blob_read.html. Is there a method for accomplishing this via better-sqlite3?

Prinzhorn commented 3 years ago

I think this is somewhat similar to #241, at least the response is relevant as well.

Your architecture would probably benefit from storing the files in a file system and only use SQLite to store the meta data. But this depends entirely on your use-case and expected usage patterns. Storing (large) files as BLOB might work well in a single-user scenario (e.g. using SQLite as application file format) but might work horribly in a multi-user scenario (e.g. on a server with more than one user).

andykais commented 3 years ago

Its true that these are similar feature requests. This is essentially a smaller api footprint that can remain synchronous. My use case is actually single user. This is for a portable local web app

Prinzhorn commented 3 years ago

In that case I agree that it would be beneficial to have something like this. Joshua said:

Your app would essentially have to use a different database connection for each network request, which is possible, but would have a drastic effect on performance.

This assumes SQLite is used in a multi-user network context. But SQLite has much more use-cases that do not have this problem, e.g. https://sqlite.org/appfileformat.html

andykais commented 3 years ago

Your app would essentially have to use a different database connection for each network request

I dont actually think that applies to this feature request. I believe that was specific to streaming and the sqlite3_blob_open interface, which required having long standing connections to the database open. If we just look at exposing sqlite3_blob_read, then reading chunks of a blob should be no different than reading rows from a database.

Just to solidify this idea, here is an example of how sqlite blob chunks could be used to build a video streaming server:

const express = require('express');
const Sqlite = require('better-sqlite3');

const app = express();
const db = new Sqlite('./file-database.db');

const file_size_stmt = db.prepare('SELECT length(cast(file_content as blob)) FROM file_table WHERE id = ?');
const file_content_stmt = db.prepaare('SELECT file_content FROM file_table WHERE id = ?');

app.get("/video/:file_id", function (req, res) {
  const range = req.headers.range;
  const file_id = req.param.file_id;
  const file_size = file_size_stmt.get(file_id);

  // Parse Range
  // Example: "bytes=32324-"
  const CHUNK_SIZE = 10 ** 6; // 1MB
  const start = parseInt(range.replace(/\D/g, ""));
  const end = Math.min(start + CHUNK_SIZE, file_size - 1);

  const contentLength = end - start + 1;
  const headers = {
    "Content-Range": `bytes ${start}-${end}/${videoSize}`,
    "Accept-Ranges": "bytes",
    "Content-Length": contentLength,
    "Content-Type": "video/mp4",
  };
  // HTTP Status 206 for Partial Content
  res.writeHead(206, headers);

  // retrieve data in a synchronous manner, no open connections required
  const file_content = file_content_stmt.get_chunk([file_id], { start, end }); // mocking out an api function here

 // actually send the data to the client
  res.end(file_content);
});

Prinzhorn commented 3 years ago

I see, you want to do a single sqlite3_blob_read without consuming the whole thing in a single event loop step. I thought you want to read the whole blob chunk by chunk to not run out of memory. Which in theory you could using the proposed API, but then you basically have streaming. This API still has the problem of potentially reading corrupt data, but that argument applies to fs as well. It's your own responsibility to make sure that doesn't happen, e.g. by having immutable BLOBs.

I kind of like this idea, because it essentially gives userland a low level API to implement streaming (reading) on top of and moves the responsibility to outside of better-sqlite3.

Not sure how well random access via sqlite3_blob_read works but I assume the API is designed for that.

Prinzhorn commented 3 years ago

Wait a second, can't you just SELECT substr(my_blob, ...) to do the same thing and get a slice of the buffer? Is there any downside? Not sure if this would require SQLIte to do something other than sqlite3_blob_read

JoshuaWise commented 3 years ago

I'm not opposed to supporting an incremental BLOB API, but I would want to expose it in a general-purpose way that covers all use cases, rather than just this one. And that is no easy task, so it might not happen for a while.

The reasons why a simple sqlite3_blob_read might not cover all use-cases:

There is no blob writing capabilities
Each invocation would implicitly open and close a new blob handle, which might cause performance concerns (or maybe not? need to look at the SQLite3 source code)

@Prinzhorn's SELECT substr(my_blob, ...) might work, but I'm not sure if SQLite3 is smart enough to not load the entire blob into memory. That's worth investigating.

If that doesn't work, you can implement your desired functionality using a workaround that stores the chunks of a file as distinct rows:

const CHUNK_SIZE = 10 ** 6; // 1MB

const insertChunk = db.prepare('INSERT INTO chunks (filename, start, end, data) VALUES (?, ?, ?, ?)');
const selectChunksInRange = db.prepare('SELECT * FROM chunks WHERE filename = @filename AND start < @end AND end > @start ORDER BY start');

const insertFile = db.transaction((filename, buffer) => {
  for (let offset = 0; offset < buffer.length; offset += CHUNK_SIZE) {
    const offsetEnd = offset + CHUNK_SIZE;
    insertChunk.run(filename, offset, offsetEnd, buffer.slice(offset, offsetEnd));
  }
});

const selectRange = (filename, start, end) => {
  const chunks = selectChunksInRange.all({ filename, start, end });
  const buffers = chunks.map(x => x.data);
  const totalBuffer = buffers.length > 1 ? Buffer.concat(buffers) : buffers[0];
  const offset = start - chunks[0].start;
  const offsetEnd = totalBuffer.length - (chunks[chunks.length - 1].end - end);
  return totalBuffer.slice(offset, offsetEnd);
};

andykais commented 3 years ago

@Prinzhorn's SELECT substr(my_blob, ...) might work, but I'm not sure if SQLite3 is smart enough to not load the entire blob into memory. That's worth investigating.

its possible that this solves my use case perfectly, but I would need to do some testing to be certain it does it smartly. Their docs do not mention anything particularly useful about substr on BLOBs https://sqlite.org/lang_corefunc.html#substr

If X is a BLOB then the indices refer to bytes

I was able to build a fully functioning video stream using just substr though!

const fs = require('fs')
const http = require('http');
const Sqlite = require('better-sqlite3')
const express = require('express');

const app = express();
const db = new Sqlite('../databases/media/sqlite.db')

db.exec(`
DROP TABLE media_blobs;

CREATE TABLE media_blobs (
  id INTEGER PRIMARY KEY NOT NULL,
  file_content BLOB NOT NULL
);
`)

const insert_file_stmt = db.prepare('INSERT INTO media_blobs (file_content) VALUES (?)')
const file_content_stmt = db.prepare('SELECT file_content FROM media_blobs WHERE id = ?');
const file_chunk_stmt = db.prepare('SELECT substr(file_content, ?, ?) as chunk FROM media_blobs WHERE id = ?');

const static_video_content = fs.readFileSync('../static/CR226 - Felty - Movement animation.mp4')
const static_video_id  = insert_file_stmt.run(static_video_content).lastInsertRowid
const static_video_size = file_content_stmt.get(static_video_id).file_content.length;

const CHUNK_SIZE = 1024 * 1024; // 1MB

app.get('/index.html', function (req, res) {
  res.end(`
<html>
  <body>
    <h2>video:</h2>
    <video width="650" controls muted="muted" autoplay>
      <source src="/video" type="video/mp4" />
    </video>
  </body>
</html>
    `)
})

app.get("/video", function (req, res) {
  const range = req.headers.range;
  if (!range) {
    res.status(400).send("Requires Range header");
  }

  // Parse Range
  // Example: "bytes=32324-"
  const start = parseInt(range.replace(/\D/g, ""));
  const end = Math.min(start + CHUNK_SIZE, static_video_size - 1);

  const contentLength = end - start + 1;
  const headers = {
    "Content-Range": `bytes ${start}-${end}/${static_video_size}`,
    "Accept-Ranges": "bytes",
    "Content-Length": contentLength,
    "Content-Type": "video/mp4",
  };
  // HTTP Status 206 for Partial Content
  res.writeHead(206, headers);

  const { chunk } = file_chunk_stmt.get(start + 1, contentLength, static_video_id)
  res.end(chunk)
});

app.listen(8080, () => console.log('started server at http://locahost:8080'))

andykais commented 3 years ago

For those curious, I opened a thread in the sqlite forums about how to store large files, and a large number of files into a sqlite database. The responses I got were that it is possible to emulate a file system in sqlite, but its also not going to be very performant at size https://sqlite.org/forum/forumpost/c996c3d654

Some cliffnotes on important things I learned:

Millions of DB rows have an indexing overhead

you could chunk the files, storing 2 GiB or less per chunk. I'd actually suggest something more on the order of 1 MiB. That's enough to saturate any network buffers, so that you're sure to have the second megabyte of data up in RAM before you finish sending the first. Once you reach that point, there's no point making the chunks bigger, if the point is to serve them over the network

Although a BLOB value can be quite large, I don't think SQLite is capable (I'd be happy to be corrected!) of super-efficiently seeking to a random offset within it. That is, substr() or sqlite3_blob_read() with a random offset k is liable to take O(k) time (although it may be faster than that in some cases). This is because of how SQLite internally splits the blob into a linked-list of fixed-size pages in the database file, which it has to traverse in order to find the desired offset

internalfx commented 3 years ago

@andykais You should take a look at a project I built a few years ago.

It's using sequelize as the ORM but you could easily adapt the code to better-sqlite.

https://git.internalfx.com/internalfx/sqlgrid

I hope that is helpful...

WiseLibs / better-sqlite3

read BLOBs in chunks #600