Store file information in an SQLite database

xarvex commented 6 months ago

The current JSON schema used takes up a fair amount of space for JSON data, and can take some time to process a large number of files. As well, it is prone to write corruption. SQLite will help both space and performance concerns, and we can do further mitigations to avoid the mentioned corruption.

Trevo525 commented 6 months ago

I was typing up an issue on this when I saw yours, so I will copy/paste it here.

Here is what I see as the layout of the database. Don't take this as me saying, "do things this way" but more as a starting point for a discussion on what the SQLite database would look like. Then some more well-verse people in Python and this project specifically can improve upon this.

Entity: (TABLE) These are the files and maybe more?
- id: (INT) Auto-incrementing column.
- filename: (VARCHAR(32)) Example, "the-cake-is-a-lie.png", "cat.jpg", "dog.gif"
- path: (VARCHAR(260)) I chose 260 because that's the max path length in windows. Open to suggestions though, Linux and Mac may have more/less restrictions here..
- fields: (TABLE?) This would be what points to the tags and the collations I believe. If that's the case, I would think this would need to be another table and it would use the ID to point to each tag and collation.
Tag: (TABLE)
- id: (INT) Auto-incrementing column
- name: (VARCHAR(32)?) the Display name of the tag. I put 32 somewhat randomly.
- shorthand: (VARCHAR(32)?) I'm not sure what this even is. I would love an explanation.
- aliases: (VARCHAR(32)?) If the name is "TV Shows", aliases could include TV, TV Series, etc.
- subtag_ids: (TABLE?) If the Tag is Movies, the subtag_id could point to a tag called Shrek. Like Entity > Fields above, if this holds more than one value, it will need to be it's own table. One more thing I was thinking about here, is that there is a chance a user might accidentally create a loop which could break something in the future. For a very basic example of what I mean, If you make Shrek a subtag of Movies, it should not be possible to make Movies a subtag of shrek. Even if there are tags in between the two like, Movies/DreamWorks/Shrek/Movies
- color: (VARCHAR(32)?) Color the tag will display on the UI.
Collation: (TABLE) I'm not 100% sure what a collation is in the context of this program. I was thinking that it would be grouping things by images, audio, video, etc. But, that could be a single field in the Entity table. I would love to hear more about this.
- id: (INT) Auto-incrementing column
- title: (VARCHAR(32)?)
- e_ids_and_pages: (?) I'm not sure what this is for.
- sort_order: (SMALLINT) This could be a SMALLINT in SQLite, but in python, it's just an enum and it get's translated as it's transferred between python and SQLite.
- cover_id: (?) Also, not sure about this.

NiXTheDev commented 6 months ago

Entity: (TABLE) These are the files and maybe more?

id: (INT) Auto-incrementing column.

filename: (VARCHAR(32)) Example, "the-cake-is-a-lie.png", "cat.jpg", "dog.gif"

path: (VARCHAR(260)) I chose 260 because that's the max path length in windows. Open to suggestions though, Linux and Mac may have more/less restrictions here..

fields: (TABLE?) This would be what points to the tags and the collations I believe. If that's the case, I would think this would need to be another table and it would use the ID to point to each tag and collation.

Tag: (TABLE)

id: (INT) Auto-incrementing column

name: (VARCHAR(32)?) the Display name of the tag. I put 32 somewhat randomly.

shorthand: (VARCHAR(32)?) I'm not sure what this even is. I would love an explanation.

aliases: (VARCHAR(32)?) If the name is "TV Shows", aliases could include TV, TV Series, etc.

subtag_ids: (TABLE?) If the Tag is Movies, the subtag_id could point to a tag called Shrek. Like Entity > Fields above, if this holds more than one value, it will need to be it's own table. One more thing I was thinking about here, is that there is a chance a user might accidentally create a loop which could break something in the future. For a very basic example of what I mean, If you make Shrek a subtag of Movies, it should not be possible to make Movies a subtag of shrek. Even if there are tags in between the two like, Movies/DreamWorks/Shrek/Movies

color: (VARCHAR(32)?) Color the tag will display on the UI.

Collation: (TABLE) I'm not 100% sure what a collation is in the context of this program. I was thinking that it would be grouping things by images, audio, video, etc. But, that could be a single field in the Entity table. I would love to hear more about this.

id: (INT) Auto-incrementing column

title: (VARCHAR(32)?)

e_ids_and_pages: (?) I'm not sure what this is for.

sort_order: (SMALLINT) This could be a SMALLINT in SQLite, but in python, it's just an enum and it get's translated as it's transferred between python and SQLite.

cover_id: (?) Also, not sure about this.

Possible SQL?

CREATE TABLE entities (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  filename VARCHAR(32),
  path VARCHAR(260),
  fields BLOB(?)
);
-- as a collumn above, as a table below
CREATE TABLE Fields (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  entity_id INTEGER,
  tag_id INTEGER,
  collation_id INTEGER,
  FOREIGN KEY (entity_id) REFERENCES entities(id),
  FOREIGN KEY (tag_id) REFERENCES tags(id),
  FOREIGN KEY (collation_id) REFERENCES collations(id)
);

CREATE TABLE tags (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  name VARCHAR(32),
  shorthand VARCHAR(32),
  aliases VARCHAR(32),
  color VARCHAR(32)
);

CREATE TABLE subtags (
  tag_id INTEGER,
  subtag_id INTEGER,
  PRIMARY KEY (tag_id, subtag_id),
  FOREIGN KEY (tag_id) REFERENCES tags(id),
  FOREIGN KEY (subtag_id) REFERENCES tags(id)
);

CREATE TABLE collations (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  title VARCHAR(32),
  entity_ids_and_pages BLOB(?),
  sort_order SMALLINT,
  cover_id INTEGER
);

rugter100 commented 6 months ago

I have a lot of experience working with SQL in python. I could try fully implementing this and would love to have it as a feature. I might even see if it could easily support other sql databases like mysql/mariadb, shouldnt be too hard

xarvex commented 6 months ago

As of right now this is something I am working on for the Rust/Tauri build of the project, where we are throwing these kind of ambitious breaking changes. I'm tagging @CyanVoxel for his input on if an implementation in the current Python project would be appreciated.

Trevo525 commented 6 months ago

@rugter100 That would be great. Just don't do it before @CyanVoxel has the chance to look this over. They might see some glaring oversights :)

NiXTheDev commented 6 months ago

I have a lot of experience working with SQL in python. I could try fully implementing this and would love to have it as a feature. I might even see if it could easily support other sql databases like mysql/mariadb, shouldnt be too hard

~~i've forked the repo on my own and i'm trying to implement that too, but i can't quite find where the ts_library.json gets written to 😢~~ nvm i'm dumb it's library.py beside ts_core.py...

CyanVoxel commented 6 months ago

i've forked the repo on my own and i'm trying to implement that too, but i can't quite find where the ts_library.json gets written to 😢

In the user's selected library directory, inside the .TagStudio folder

NiXTheDev commented 6 months ago

@CyanVoxel @rugter100 found a good site to build up a database diagram for the transition to sqlite https://dbdiagram.io/d

Trevo525 commented 6 months ago

Here is what I came up with.

Untitled (1)

TABLE entities {
  id INTEGER pk unique
  filename VARCHAR
  path VARCHAR
}

TABLE fields {
  id INTEGER pk unique
  entity_id INTEGER
  tag_id INTEGER
  collation_id INTEGER
}

TABLE tags {
  id INTEGER pk unique
  name VARCHAR(32)
  shorthand VARCHAR(32)
  aliases VARCHAR(32)
  color VARCHAR(32)
}

TABLE subtags {
  tag_id INTEGER pk unique
  subtag_id INTEGER
}

TABLE collations {
  id INTEGER pk unique
  title VARCHAR(32)
  sort_order SMALLINT
  cover_id INTEGER
}

Ref: fields.entity_id > entities.id // many-to-one
Ref: fields.tag_id > tags.id // many-to-one
Ref: fields.collation_id > collations.id // many-to-one
Ref: subtags.tag_id > tags.id // many-to-one
Ref: subtags.subtag_id > tags.id // many-to-one

parthos-dev commented 6 months ago

I want to put DuckDB into consideration. It believe it plays better with network shares than SQLite. It can read and write from plain text formats too so that functionality could come in handy for the export feature.

Trevo525 commented 6 months ago

I want to put DuckDB into consideration. It believe it plays better with network shares than SQLite. It can read and write from plain text formats too so that functionality could come in handy for the export feature.

I'm very interested. I read a few articles and comparisons against SQLite. But, I am curious what you mean by "plays better with network shares than SQLite." Also, I would love to see any proof/source of that.

parthos-dev commented 6 months ago

I am mistaken actually. I read this and a couple of other pages of DuckDB and was under the impression that it does not have some of the issues of file locking over SMB (Depending on the SMB server config) that SQLite had given me in the past (Trying to point a Calibre library to a SMB share). A quick test later, it is not the case.

There is still a plus of the ability to R/W to and from plain text files though.

widogg commented 2 months ago

Hi everyone! I just saw the latest video from CyanVoxel on YT and saw about the sqlite migration. I'm assuming that a big part of the problem you are dealing with is migrate from a schema-less json to a schema tabled DB... but perhaps you don't have to?

Have you consider using sqlite as a document schema-less DB? SQLite has support for json and jsonb, so perhaps you can keep entities, collations and fields as tables but move the tags and subtags as documents and keep "the best" of both worlds?

Just a thought, proyect looks nice you've done a great job :)

Everspace commented 2 months ago

I also just came from the video, and wanted to chip in the idea of libsql since I also am worried about the support over the network. Although it does seem like the python sdk is still experimental, and I also find Turso's branding to be too integrated to be easy to find networked setup... at a base it would allow for the avoidance of corruption that is common in sqlite over the network.

The alternative is a more abstract data access layer and supporting something like postgres with a fallback to local sqlite... and dealing with the mild warts of swapping between them (sqlite is a lot less strict iirc).

My use case is that I would like to have a central place to host all my references for drawing that is backed by a robust network'd solution that I already have in place, and to keep that tag information networked similarly as well.

zierf commented 2 months ago

I had also thought about using SurrealDB for a tagging application (also has Python bindings)

It supports both, databases as a server and embedded databases such as SQLite.

In addition, some concepts from other databases are also supported. I would prefer a structure, but in principle it can also be operated schema-less.

I also found the graph support interesting, where any connections between the data can be linked and queried against later.

If necessary, some logic could also be executed directly in the database using JS/WebAssembly before the data is delivered.

In case local AI models for automatic tagging or searching are to be added in the future, it can also function as a vector database and even embed AI models.

nino-juan commented 1 month ago

Im not in the context of the work you guys have been doing, so this might not be relevant, but I just came from the yt video to drop a suggestion: SPARQL instead of SQL?

Have you considered semantic graphs such as in open linked data or semantic web? Meaning graphs lend themselves better to semantic querying than strict hierarchy trees (aka: mario--wears-->overalls + photo--depicts-->mario = photo --depicts-->mario--wears-->overalls)... I've had some projects with people in digital humanities and they love linked open data stuff ( thats how massive repositories of digital cultural herritage tag their stuffs ( europeana, bnf ) Besides you could use already definded ontologies, so people could just reuse already established conventions for different fields (ontologies of RDF triplets) [semantic data]

yedpodtrzitko commented 1 month ago

Thanks everyone for suggestions. The main issue is not coming up with ideas which database to use, the actual problem is to actually implement it. If you are confident your proposed database is the best solution, please open a Pull Request implementing the changes, thanks. Maybe you can use parts of #332 as some base foundation.

TagStudioDev / TagStudio

Store file information in an SQLite database #50