TagStudioDev / TagStudio

A User-Focused Photo & File Management System
GNU General Public License v3.0
2.96k stars 268 forks source link

Store file information in an SQLite database #50

Open xarvex opened 2 months ago

xarvex commented 2 months ago

The current JSON schema used takes up a fair amount of space for JSON data, and can take some time to process a large number of files. As well, it is prone to write corruption. SQLite will help both space and performance concerns, and we can do further mitigations to avoid the mentioned corruption.

Trevo525 commented 2 months ago

I was typing up an issue on this when I saw yours, so I will copy/paste it here.

Here is what I see as the layout of the database. Don't take this as me saying, "do things this way" but more as a starting point for a discussion on what the SQLite database would look like. Then some more well-verse people in Python and this project specifically can improve upon this.

NiXTheDev commented 2 months ago
  • Entity: (TABLE) These are the files and maybe more?

    • id: (INT) Auto-incrementing column.
    • filename: (VARCHAR(32)) Example, "the-cake-is-a-lie.png", "cat.jpg", "dog.gif"
    • path: (VARCHAR(260)) I chose 260 because that's the max path length in windows. Open to suggestions though, Linux and Mac may have more/less restrictions here..
    • fields: (TABLE?) This would be what points to the tags and the collations I believe. If that's the case, I would think this would need to be another table and it would use the ID to point to each tag and collation.
  • Tag: (TABLE)

    • id: (INT) Auto-incrementing column
    • name: (VARCHAR(32)?) the Display name of the tag. I put 32 somewhat randomly.
    • shorthand: (VARCHAR(32)?) I'm not sure what this even is. I would love an explanation.
    • aliases: (VARCHAR(32)?) If the name is "TV Shows", aliases could include TV, TV Series, etc.
    • subtag_ids: (TABLE?) If the Tag is Movies, the subtag_id could point to a tag called Shrek. Like Entity > Fields above, if this holds more than one value, it will need to be it's own table. One more thing I was thinking about here, is that there is a chance a user might accidentally create a loop which could break something in the future. For a very basic example of what I mean, If you make Shrek a subtag of Movies, it should not be possible to make Movies a subtag of shrek. Even if there are tags in between the two like, Movies/DreamWorks/Shrek/Movies
    • color: (VARCHAR(32)?) Color the tag will display on the UI.
  • Collation: (TABLE) I'm not 100% sure what a collation is in the context of this program. I was thinking that it would be grouping things by images, audio, video, etc. But, that could be a single field in the Entity table. I would love to hear more about this.

    • id: (INT) Auto-incrementing column
    • title: (VARCHAR(32)?)
    • e_ids_and_pages: (?) I'm not sure what this is for.
    • sort_order: (SMALLINT) This could be a SMALLINT in SQLite, but in python, it's just an enum and it get's translated as it's transferred between python and SQLite.
    • cover_id: (?) Also, not sure about this.

Possible SQL?

CREATE TABLE entities (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  filename VARCHAR(32),
  path VARCHAR(260),
  fields BLOB(?)
);
-- as a collumn above, as a table below
CREATE TABLE Fields (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  entity_id INTEGER,
  tag_id INTEGER,
  collation_id INTEGER,
  FOREIGN KEY (entity_id) REFERENCES entities(id),
  FOREIGN KEY (tag_id) REFERENCES tags(id),
  FOREIGN KEY (collation_id) REFERENCES collations(id)
);

CREATE TABLE tags (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name VARCHAR(32),
  shorthand VARCHAR(32),
  aliases VARCHAR(32),
  color VARCHAR(32)
);

CREATE TABLE subtags (
  tag_id INTEGER,
  subtag_id INTEGER,
  PRIMARY KEY (tag_id, subtag_id),
  FOREIGN KEY (tag_id) REFERENCES tags(id),
  FOREIGN KEY (subtag_id) REFERENCES tags(id)
);

CREATE TABLE collations (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  title VARCHAR(32),
  entity_ids_and_pages BLOB(?),
  sort_order SMALLINT,
  cover_id INTEGER
);
rugter100 commented 2 months ago

I have a lot of experience working with SQL in python. I could try fully implementing this and would love to have it as a feature. I might even see if it could easily support other sql databases like mysql/mariadb, shouldnt be too hard

xarvex commented 2 months ago

As of right now this is something I am working on for the Rust/Tauri build of the project, where we are throwing these kind of ambitious breaking changes. I'm tagging @CyanVoxel for his input on if an implementation in the current Python project would be appreciated.

Trevo525 commented 2 months ago

@rugter100 That would be great. Just don't do it before @CyanVoxel has the chance to look this over. They might see some glaring oversights :)

NiXTheDev commented 2 months ago

I have a lot of experience working with SQL in python. I could try fully implementing this and would love to have it as a feature. I might even see if it could easily support other sql databases like mysql/mariadb, shouldnt be too hard

i've forked the repo on my own and i'm trying to implement that too, but i can't quite find where the ts_library.json gets written to 😢 nvm i'm dumb it's library.py beside ts_core.py...

CyanVoxel commented 2 months ago

i've forked the repo on my own and i'm trying to implement that too, but i can't quite find where the ts_library.json gets written to 😢

In the user's selected library directory, inside the .TagStudio folder

NiXTheDev commented 2 months ago

@CyanVoxel @rugter100 found a good site to build up a database diagram for the transition to sqlite https://dbdiagram.io/d

Trevo525 commented 2 months ago

Here is what I came up with.

Untitled (1)

TABLE entities {
  id INTEGER pk unique
  filename VARCHAR
  path VARCHAR
}

TABLE fields {
  id INTEGER pk unique
  entity_id INTEGER
  tag_id INTEGER
  collation_id INTEGER
}

TABLE tags {
  id INTEGER pk unique
  name VARCHAR(32)
  shorthand VARCHAR(32)
  aliases VARCHAR(32)
  color VARCHAR(32)
}

TABLE subtags {
  tag_id INTEGER pk unique
  subtag_id INTEGER
}

TABLE collations {
  id INTEGER pk unique
  title VARCHAR(32)
  sort_order SMALLINT
  cover_id INTEGER
}

Ref: fields.entity_id > entities.id // many-to-one
Ref: fields.tag_id > tags.id // many-to-one
Ref: fields.collation_id > collations.id // many-to-one
Ref: subtags.tag_id > tags.id // many-to-one
Ref: subtags.subtag_id > tags.id // many-to-one
parthos-dev commented 2 months ago

I want to put DuckDB into consideration. It believe it plays better with network shares than SQLite. It can read and write from plain text formats too so that functionality could come in handy for the export feature.

Trevo525 commented 2 months ago

I want to put DuckDB into consideration. It believe it plays better with network shares than SQLite. It can read and write from plain text formats too so that functionality could come in handy for the export feature.

I'm very interested. I read a few articles and comparisons against SQLite. But, I am curious what you mean by "plays better with network shares than SQLite." Also, I would love to see any proof/source of that.

parthos-dev commented 1 month ago

I am mistaken actually. I read this and a couple of other pages of DuckDB and was under the impression that it does not have some of the issues of file locking over SMB (Depending on the SMB server config) that SQLite had given me in the past (Trying to point a Calibre library to a SMB share). A quick test later, it is not the case.

There is still a plus of the ability to R/W to and from plain text files though.