DB migration on BIG databases will take several days...

Tzahi12345 / YoutubeDL-Material

Self-hosted YouTube downloader built on Material Design

MIT License

2.65k stars 276 forks source link

DB migration on BIG databases will take several days... #295

Closed GlassedSilver closed 3 years ago

GlassedSilver commented 3 years ago

I understand that having 35k entries is gonna take a while, but is it really efficient to completely re-write file to disk in 4KB steps?

I take it every single item gets individually saved by saving the entire new db file anew? (idk, that's how it looks like if you inspect the file in Windows Explorer going from old size to 0kb to new size in 4KB increments.

Also I noticed that the file doesn't get new writes anymore after I accidentally updated the docker container before the migration was done. I can tell by only having 30k out of an expected 35k items in the file.

Another thing I wonder is... Since we have to expect instances growing in size quite quickly... wouldn't the flexibility of a relational database (relations could become big like "is a second part video to video ID xyz" and stuff like that with the better scalability make a lot of sense for this application at this point? Also mass-adding fields would not require rewriting x amounts of chunks of text to one file where x = amount of items. This is, as I'm currently witnessing, a scalability nightmare.

For debugging info: my db.json ends with this info:

  "files_to_db_migration_complete": false,
  "categories": [],
  "simplified_db_migration_complete": true

So it knows the migration is not done yet, but it has done a simplified migration completely? (what's the difference here btw?)

Tzahi12345 commented 3 years ago

I'm not sure why it's doing it in 4 KB increments, that's quite weird/annoying. There's a relatively easy fix to this, but I'll go into why we don't use a more efficient database first.

In short, I began writing this in 2017 when my coding abilities were subpar to say the least. I didn't really know how to code in a production environment (I do now), and so when I look back at some of the legacy code I sometimes cringe. It's also the reason why the backend is mostly in one massive 3 thousand line file. Migration wouldn't be so hard, I'm just a little lazy :P

There's a short term fix here and a long term fix. The long term fix is moving to a new database system. I'll put a lot more thought into this but let's do the short term fix first, which is getting your db file properly migrated, so I'll explain what changed:

Previously files and playlists in the DB were separated by type:

"files": {
    "audio": [
        <audio files are here>
    ],
    "video": [
        <video files are here>
    ]
},

Now they're combined into one big list (in javascript these are called arrays):

"files": [
    <all your files are here>
],

Previously, any time we wanted to do modify or retrieve a file, we had to know its type which was annoying and unnecessary. This way, we don't, and each individual file already had a type field (isAudio) so we're using that now. Playlists had the same structure, and so they are simplified as well.

To fix your DB, open your old and new db and copy your audio and video files over into the new structure (delete the ones currently there to avoid duplicates). Make sure you take care of your playlists too. This should be a lot faster than in 4KB increments, I have no idea why it's doing that in the first place anyways.

Before starting the server, set the simplified_db_migration_complete to false. The files will be recognized as already migrated, and it will add new metadata fields instead which hopefully won't take too long.

If this is too daunting, no worries! I'll gladly do it for you :) just find my email in my profile and shoot your db file over.

I know none of this is ideal, perhaps I should create a super long db anyways so I can test this at scale, because maybe I would've discovered a better way to do this.

Also, not sure why it says the migration is done when it didn't complete, maybe after failing on the first run it skipped the expensive migration step on the second run and marked it as complete.

Regardless, I'll look at some relational dbs we can use instead, maybe we'll just have to move to mongodb.

GlassedSilver commented 3 years ago

The db files are quite large at 35k items, so yeah, if you really wanna do it, I guess automated with scripts is what you'd do given that the old json is not using linebreaks and hence is very unwieldly to handle manually, then sure.

Also I noticed the newer db is a bit larger, so I take it there's more than just copy and paste involved?

Also: I have some newer files than the backup of the old db I can refer to that got added, would those just be at the bottom of the list then? I had to take a week old backup, because there was some wonkiness with the files.

I would have an easier time to explain it if I really knew what happened in the background.

No single file has all the entries atm. I just know that I have a one week old db file that has everything that was downloaded until then. Then there's like a week's worth of new subscription downloads I guess, shouldn't be too much.

I'd gladly send them over, guess I'd stop the server until then so I don't generate orphaned downloads.

Here's a little suggestion: maybe it would be good if ytdl-m saved a sidecar file with every download with all of the info that ytdl-m has in the db at download time or shortly after.

This would have the following advantages: we could have a repair-db feature that would function with a completely empty db!

We could that way backup the metadata we have per-video. (or audio, I guess I mean file when I say video in any case) It would also mean that the meta data for the day it got downloaded onto our servers would not be the day of the reset. (if you know what I mean)

When you pick a new db system, I guess it'd be best to use something that is workable really well within the docker itself as well, please do not rely on the user having to interface with an external db. Sure, it's cool if you can set it up that way through the config file, but I think a fully self-contained solution would be the most user-friendly.

Secondly, it'd be wise to go with whatever FOSS AI technologies have the best interoperability with, since a good recommendation feature will have to interface with the DB as well at some point. Maybe I imagine that to be more limiting than it actually is. In any case, something tried and true is key.

Bonus if the db allows for something like mobile clients that could sync plays and offline files easily.

jonathanm-ur commented 3 years ago

My db.json file ballooned from 20MB to 42MB.

Also, I've been having issues with the db.json file not closing correctly when the container is stopped, leaving a 0 byte file. Overwriting with my last manually copied backup while the container is running but idle will get things back working, albeit with delays as it walks the files and finds all the new entries since my backup.

To be clear, the 0 byte thing has been happening for weeks, perhaps months. I know because the automatic backup script stops the container and takes a backup, which invariably results in a 0 byte file. Subsequent restart doesn't download anything until I restore the database so my subs and playlists are once again populated.

This is by no means a showstopper for me, I've gotten quite used to checking for new downloads, if none are present and they should be, I just stop the container, copy in my backup db.json file, and start back up.

It would be nice if the application wrote its own backup db at scheduled intervals, that way the backups would always be relatively current if something bombs out in the active db. Something like every day at 4am pause any db activity and copy db.json to db.json.bak or something like that.

Tzahi12345 commented 3 years ago

My db.json file ballooned from 20MB to 42MB.

This is because we've decided to store more data for the files, specifically the video description. JSON is quite inefficient as data storage techniques go, so we really need to switch to another database system as megabytes-large single text files are just not gonna do.

Also, I've been having issues with the db.json file not closing correctly when the container is stopped, leaving a 0 byte file. Overwriting with my last manually copied backup while the container is running but idle will get things back working, albeit with delays as it walks the files and finds all the new entries since my backup.

This is quite concerning, I've never had this issue and I'm not sure what could be causing it. It might be a simple case of corruption during the write.

It would be nice if the application wrote its own backup db at scheduled intervals, that way the backups would always be relatively current if something bombs out in the active db. Something like every day at 4am pause any db activity and copy db.json to db.json.bak or something like that.

I'm glad you're taking backups, it would be a good idea to do this ourselves as well. On a related note, I found a good alternative to lowdb which might serve us well. I'll see if I can implement something tonight, and I'll see if NeDB is viable.

jonathanm-ur commented 3 years ago

The zero byte database is easily observable in my case, just by doing a refresh of the file listing it will toggle between db.json showing the correct value for size, and a zero byte db.json when the application is working hard. It appears to me as if the app is deleting the db file on disk while it prepares to write the updated file entirely from RAM instead of updating in place.

For something as drastic as a new db engine, how much coding work would it be to allow switching back and forth? I would love to keep the human readable db as an option for manually manipulating things, but be able to use an optimized binary blob db for speed and efficiency. For example, export or import to the current db.json structure, and be able to run from either db.json or the NeDB file depending on a user configured setting, much like nextcloud can run with an internal sqlite database or connect to a mariadb instance.

Tzahi12345 commented 3 years ago

The zero byte database is easily observable in my case, just by doing a refresh of the file listing it will toggle between db.json showing the correct value for size, and a zero byte db.json when the application is working hard. It appears to me as if the app is deleting the db file on disk while it prepares to write the updated file entirely from RAM instead of updating in place.

Haven't observed this directly but windows explorer is notoriously bad at updating itself even with the refresh button.

For something as drastic as a new db engine, how much coding work would it be to allow switching back and forth? I would love to keep the human readable db as an option for manually manipulating things, but be able to use an optimized binary blob db for speed and efficiency. For example, export or import to the current db.json structure, and be able to run from either db.json or the NeDB file depending on a user configured setting, much like nextcloud can run with an internal sqlite database or connect to a mariadb instance.

I was actually wondering how to keep both engines synchronized but I think your solution gets around that quite cleverly. The difficulty in converting between either one depends on how efficient I want to get.

In the JSON format, there are no "tables", but in NeDB there are (called datastores). Right now a users files are stored under each user, but in NeDB (if I was being efficient/smart about it), I would create a files table and store each file in there with a field called user_uid to make it queryable. Of course we'd also have a users table that only stores metadata info for the user. This is just an example, I'd have to do the same with playlists, subscriptions, etc.

That would make it a pain in the ass to convert back and forth. If I do it the dumb way, converting is dead simple. I'd have two tables just like I have now. users.json would become the users table with all the info under each user, and db.json would become the single table or something like that. As I mentioned earlier in this thread, I cringe looking at how I set up stuff like the database earlier, but the one thing I do enjoy is it being human-readable and I'd hate to lose that. So I don't know, maybe I should just do the daunting task of making an efficient NeDB system and creating a way to convert back and forth.

GlassedSilver commented 3 years ago

Converting between two different db systems?

Oh I'm not sure I like the sound of that.

I think much better would be a tool in ytdl-m to edit the metadata we store about files.

Another bonus: work on the files right from the GUI and you could long-term have something like mass edits. (change channel name and apply to all that match the selection for example)

Human readable IS nice, but I guess we could also just have an import and export feature that would create a json for the selection or everything.

This could also be used as a backup mechanism AND since jsons are simple text files users who are on filesystems with deduplication those backups would take up very little space even en masse.

GlassedSilver commented 3 years ago

As for the 0kb db thing... That has never been a problem for me I think...

Then again maybe my container only got stopped when no activity was going on.

(side question: if there's a subscription download going on OR a Twitch chat download (automated) and I stop the container mid-way... would ytdl-m pick up the download on next start again? I've always been wondering about that. I know that youtube-dl on its own does allow such behavior, but it has to be told to download something by whatever interfaces with it, be it a human user in one-off usage, scripts or an application like ytdl-m.)

Tzahi12345 commented 3 years ago

side question: if there's a subscription download going on OR a Twitch chat download (automated) and I stop the container mid-way... would ytdl-m pick up the download on next start again? I've always been wondering about that.

Subscriptions do start from where they left off (unless the archive logic messes with this) as we use the -c arg for the download. Twitch chat does not, it'll start from scratch.

jonathanm-ur commented 3 years ago

Also, I've been having issues with the db.json file not closing correctly when the container is stopped, leaving a 0 byte file. Overwriting with my last manually copied backup while the container is running but idle will get things back working, albeit with delays as it walks the files and finds all the new entries since my backup.

This is quite concerning, I've never had this issue and I'm not sure what could be causing it. It might be a simple case of corruption during the write.

Just following up on this, the automatic backup (CA Backup) finally got a full uncorrupted db file this past run, 3 January.

Something apparently changed the better, as this is the first time in a long time that the CA Backup archive included anything but a 0 byte db.json file.

jonathanm-ur commented 3 years ago

I captured a refresh showing what I mean with the 0 byte thing. It will still lose my database if the container is shut down while the db.json file is being written, and I don't always have control over that timing. The good thing is, making a copy while the container is running but idle works just fine, if the copy is full size I can be fairly sure it will be a good backup.

After posting, I see that gif is VERY irritating to look at. I will remove it in a day or so.

GlassedSilver commented 3 years ago

Just lost my entire db and apparently my subscription list as well. :/

Well, last Wednesday when I had to restart my server a few times.

I guess if you play whack a mole often enough the mole is hit eventually.

This is... an issue to say the least. I would STRONGLY advise moving to a db server like Postgres, because unclean shutdowns of the application are eh...

A db in a container shutting down uncleanly is apparently handled better than the file that kinda "isn't there" for some moments from time to time.

jonathanm-ur commented 3 years ago

Just lost my entire db and apparently my subscription list as well. :/

Yep. I've resigned myself to a new way of utilizing this app, because my db gets wiped on a semi-weekly basis. I requested a change in the way subs are handled, because the file list db isn't as big of a deal to rebuild, but was told it's not an option. https://github.com/Tzahi12345/YoutubeDL-Material/issues/350

I recommend taking daily backups of the db after shutting down the container. However, don't stop the container using Unraid's stop function, if there is activity the stop button doesn't allow enough time for things to settle and you will end up with a zero db.

You must use something similar to this, which allows the container 1 minute to get it's affairs in order. I haven't had an issue with a simple stop causing corruption ever since I started doing it this way. #!/bin/bash /usr/bin/docker stop --time=60 youtube-dl-material

However, I'm still having issues with the container crashing and causing corruption if I have too many subs active (not paused) at once. I can safely allow 2 or 3 subs to run at a time. So, my new workflow is every few days pause all subs, stop the container, take a backup of the db, start the container, unpause a couple subs, let them catch up, pause them, unpause the next batch, etc, etc. That way if I get a crash my latest db has all my subs in a paused state, so I can do whatever maintenance I need before turning the app loose again.

GlassedSilver commented 3 years ago

Just lost my entire db and apparently my subscription list as well. :/

Yep. I've resigned myself to a new way of utilizing this app, because my db gets wiped on a semi-weekly basis. I requested a change in the way subs are handled, because the file list db isn't as big of a deal to rebuild, but was told it's not an option.

350

I recommend taking daily backups of the db after shutting down the container. However, don't stop the container using Unraid's stop function, if there is activity the stop button doesn't allow enough time for things to settle and you will end up with a zero db.

You must use something similar to this, which allows the container 1 minute to get it's affairs in order. I haven't had an issue with a simple stop causing corruption ever since I started doing it this way. #!/bin/bash /usr/bin/docker stop --time=60 youtube-dl-material

However, I'm still having issues with the container crashing and causing corruption if I have too many subs active (not paused) at once. I can safely allow 2 or 3 subs to run at a time. So, my new workflow is every few days pause all subs, stop the container, take a backup of the db, start the container, unpause a couple subs, let them catch up, pause them, unpause the next batch, etc, etc. That way if I get a crash my latest db has all my subs in a paused state, so I can do whatever maintenance I need before turning the app loose again.

Holy shit... I wonder if changing the Docker Stop Timeout would help though?

10 always seemed a bit OVERLY excessively aggressive to me. I think I'll change it to 10 minutes, because that should really help in cases when the CPU load may also be overly saturated from other tasks and I have to blindly issue shutdown commands from the terminal and stuff like that.

Ideally it would always ask for confirmation for that though... Like ask after 10 seconds sure and otherwise wait for a long time.

Also, I think you should be able to set different values for different containers. A database getting axed after 10 seconds is another league than killing a frontend-container that's separate from its database IMHO.

Maybe some feedback for the unRAID team. ponders

GlassedSilver commented 3 years ago

Another case for a proper db would be (especially one that has a BIG user base, so it's not "evolving" (introducing breaking changes we need to keep an eye on too much) too often):

I could inspect it MUCH better than a dozen of megabytes and ever-growing JSON file in VS Code. Seriously... I don't even attempt to understand what caused #369 because of this.

MariaDB, MySQL, PostgreSQL, ... anything, please! Especially as you could optimize operations by keeping the db on a dedicated db server if you were so included to do so.

Just a lot of upsides if you ask me.

GlassedSilver commented 3 years ago

The zero kb issue is absolutely in full swing now for me. For the second time in a row I have to reinstate an older db backup and then let it slowly re-index anything not in it.

Tzahi12345 commented 3 years ago

Good news @GlassedSilver, I've spent the last several weeks porting over all the code to support MongoDB, as well as a local DB solution which will be the default. I'll let you know when this PR is merged but it should fix all your issues and make your instance wayyy quicker. Took a while and involved a lot of cleanup but it's worth it!

I'm going to include it in my concurrent streams PR (#378). I'll comment here when that's merged. Want to do some final testing and update the PR (since it's way bigger now), and it does involve another DB migration but it should be the last for a while.

GlassedSilver commented 3 years ago

Wow, that sounds amazing! :O

One question though and I reaaaaally don't wanna be petty and hey, you did the coding and planning, and I'm sure I'm just not thinking of stuff you thought of, but what led you to go with MongoDB if I may ask?

IMHO this application's use case is a very good example of the benefits of having a relational database. For example: every video can be related to a creator, some even to playlists, those then can be assigned to creators or even local users, comment sections can be linked to videos and loaded or queried separately, permissions can be assigned more easily even I think, but I might be off on this one, maybe depends on how you actually use permissions.

I once looked into MongoDB myself, because I liked that it was lauded for being modern and fast, but then eventually decided to settle on "boring" MySQL, MariaDB and Postgres where ever possible in my apps when they let me pick, one of the reasons being relational structure and also just getting the biggest support base.

Please enlighten me, because I really REALLY think we have a prime example of a relational database use case here.

In any case, I cherish the news and really look forward to any improvements, I'm forever grateful for this application, it's already saved heaps of stuff for me and going through old YT playlists of mine and seeing all those "Deleted Video" entries makes me feel utterly sad

Knowing this will be LOTS less of an issue in the future is INCREDIBLE. :)

Tzahi12345 commented 3 years ago

Thanks for the kind words :)

The simple answer as to why I chose MongoDB is that I don't have to do any work to define the fields within a table. I can't really speak for MariaDB but I've worked with Postgres and SQL Server. Half of this is pure laziness and the other half is that there's still a table with nested properties, so it helps maintain some level of portability with the existing code. That, and I've already messed around with Firebase and MongoDB seemed really similar.

This affects performance a bit but with indexing this can apparently be mitigated (just found this out!) As you can probably tell, I'm not super experienced with MongoDB either, just know that it plays really well with JSON-type data structures.

The way I set up the DB, it's still relational in an abstract sense -- subscription videos aren't stored in the subscription anymore, now they just have a "foreign key" called sub_id. Playlist videos are still stored as an array in a playlist object called uids, so this structure isn't universal, mostly due to the fact that playlist videos are ordered and it's a many-to-many relationship.

So to avoid any problems with a true relational DB, I just went with Mongo. I already noticed a speedup vs. using a local DB, but I'll do some actual testing and get some numbers to prove it. I'll work on the indexing stuff to make it even more performant, hopefully I can merge this in by Wednesday or so and you can let me know how it works for you. Let me know if you have any other questions! Having a second set of eyes on the backend stuff is always helpful.

GlassedSilver commented 3 years ago

I'm mighty curious let me tell you that. :D

I'm still a bit hesitant, because the nice thing about a relational database is that it basically is a bit more effort to set up initially, but afterwards really keeps itself tidy very well by having data (and relational) data integrity by design.

But you know what? As long as the model is in some form of structure and reliability goes up one could in the worst case always refactor later if it doesn't work out too well.

In any case I am so super stoked to see my beloved ytdl-m back in full swing. :)