funmaker / Hybooru

Hydrus-based booru-styled imageboard in React
https://booru.funmaker.moe/
MIT License
95 stars 15 forks source link

Rework the Database #41

Closed ReconVirus closed 3 months ago

ReconVirus commented 3 months ago

In part of the ongoing development process, I propose reworking the database import process to improve efficiency and maintain data integrity. Currently, users are required to delete the entire dataset before sending a rebuilding, request which can be time-consuming. Instead, Im sure alot of end users would recommend implementing an incremental update approach:

  1. Incremental Updates: Modify the import process to identify changes since the last import. Only new or updated data should be processed, minimizing the need for a full reimport.

  2. Selective Deletion: When necessary, selectively delete outdated records based on the changes detected during the import. This approach ensures that existing data remains intact while accommodating necessary updates.

Im sure this request has been brought to your attention , so i figure i would post this as a issue(request).

funmaker commented 3 months ago

Incremental database updates are not in the scope for me. It would greatly increase complexity of the import process. I know that it might be inconvenient for people with larger databases that get updated frequently, however for my own personal usecases this is not an issue.

However since many people asked for it, I will leave this issue open in case someone wants to try implementing it and/or discuss possible solutions.

ReconVirus commented 3 months ago

...discuss possible solutions.

My best guess would be to add something like

    // Incremental Updates
    const lastImportTime = await this.getLastImportTime(); // Retrieve the last import time from the database
    await this.updateLastImportTime(Date.now()); // Update the last import time after a successful import

    // Selective Deletion
    const outdatedRecords = await this.getOutdatedRecords(lastImportTime); // Retrieve outdated records based on the last import time
    await this.deleteOutdatedRecords(outdatedRecords); // Delete outdated records

which can be stored in the postgress datebase for future reference

CREATE TABLE import_status (
  id SERIAL PRIMARY KEY,
  last_import_time TIMESTAMP DEFAULT NOW()
);

an alternative to this might just be using hydrus version number seeing how you're already using that in your index.ts

const { version } = hydrus.prepare('SELECT version FROM version;').get();
    if(version < MIN_HYDRUS_VER) throw new Error(`Unsupported Hydrus version(min: v${MIN_HYDRUS_VER}, current: v${version}), Update Hydrus to the newest version.`);

which could be a safe approach to allowing the user to import changes between versions without the need of a full on rebuild.

As far as me actually implementing this... I honestly wouldnt know where to start .

i feel like i was pretty cloest with trying something like this on the import.ts

import { Writable } from "stream";
import { Database, Statement } from "better-sqlite3";
import { PoolClient } from "pg";
import copy from "pg-copy-streams";
import configs from "../configs";
import { printProgress } from "./pretty";

export abstract class Import {
  constructor(protected hydrus: Database, protected postgres: PoolClient, protected lastImportTime: Date) {}

  abstract display: string;
  batchSizeMul = 1;
  initialKey: any[] = [-1];
  useTemp = false;

  abstract outputTable: string;
  abstract totalQuery(): string;
  abstract inputQuery(): string;
  abstract outputQuery(table: string): string;

  private totalCount: number | null = null;

  total() {
    if(this.totalCount === null) this.totalCount = this.hydrus.prepare(this.totalQuery()).raw().get()[0];
    return this.totalCount!;
  }

  resetTotal(total: null | number = null) {
    this.totalCount = total;
  }

  async importBatch(lastKey: any[], limit: number, input: Statement, output: Writable, lastImportTime: Date): Promise<any[] | null> {
    const rows = input.all(...lastKey, limit);

    let buf = "";
    for(const row of rows) {
      buf += row.pop();
    }
    output.write(buf);

    if(rows.length > 0) return rows[rows.length - 1];
    else return null;
  }

  async start() {
    printProgress(false, this.display);

    const batchSize = Math.ceil(configs.importBatchSize * this.batchSizeMul);
    const total = this.total();
    let count = 0;

    if(total === 0) {
      printProgress(true, this.display);
      return;
    }

    printProgress([0, total], this.display);

    const outputTable = await this.beforeImport();

    const lastImportTime = await this.getLastImportTime();

    const input = this.hydrus.prepare(this.inputQuery()).raw(true);
    const output: Writable = await this.postgres.query(copy.from(this.outputQuery(outputTable)));

    let lastKey: any[] = this.initialKey;

    // eslint-disable-next-line no-constant-condition
    while(true) {
      const result = await this.importBatch(lastKey, batchSize, input, output, lastImportTime);
      if(!result) break;
      lastKey = result;
      count += batchSize;

      if(output.writableLength > output.writableHighWaterMark) {
        printProgress([count, total], this.display);
        await new Promise(res => output.once("drain", res));
      } else {
        await new Promise(res => setImmediate(res));
      }
    }

    await new Promise(res => output.end(res));

    await this.deleteOutdatedRecords();

    await this.updateLastImportTime();

    await this.afterImport();

    printProgress([total, total], this.display);
  }

  async beforeImport() {
    if(this.useTemp) {
      await this.postgres.query(`
        CREATE TEMP TABLE ${this.outputTable}_temp (LIKE ${this.outputTable});
      `);
      return `${this.outputTable}_temp`;
    }

    return this.outputTable;
  }

  async afterImport() {
    if(this.useTemp) {
      await this.postgres.query(`
        INSERT INTO ${this.outputTable} SELECT * FROM ${this.outputTable}_temp ON CONFLICT DO NOTHING;
        DROP TABLE ${this.outputTable}_temp;
      `);
    }
  }

  async getLastImportTime(): Promise<Date> {
    const lastImportTime = await this.postgres.query(
      `SELECT last_import_time FROM import_status WHERE id = 1`
    );

    return new Date(lastImportTime.rows[0]?.last_import_time ?? new Date());
  }

  async updateLastImportTime(): Promise<void> {
    await this.postgres.query(
      `UPDATE import_status SET last_import_time = $1 WHERE id = 1`,
      [new Date()]
    );
  }

  async getOutdatedRecords(): Promise<any[]> {
    const lastImportTime = await this.getLastImportTime();
    const outdatedRecords = await this.postgres.query(
      `SELECT * FROM ${this.outputTable} WHERE last_modified > $1`,
      [lastImportTime]
    );

    return outdatedRecords.rows
  }

  async deleteOutdatedRecords(): Promise<void> {
    const outdatedRecords = await this.getOutdatedRecords();
    const ids = outdatedRecords.map((record: any) => record.id).join(',');

    await this.postgres.query(`DELETE FROM ${this.outputTable} WHERE id IN (${ids})`);
  }
}
ReconVirus commented 3 months ago

Incremental database updates are not in the scope for me. It would greatly increase complexity of the import process. I know that it might be inconvenient for people with larger databases that get updated frequently, however for my own personal usecases this is not an issue.

ill just close this out as not planned , took me almost 2 weeks to get what you were saying 😂