Closed addisonamiri closed 4 years ago
Filebot + ACM with some config is what you need
Thanks for starting the discussion! I've always thought expanding to cover other media types would be an interesting direction.
One major roadblock is that many video file types, last I checked, don't have standard metadata fields. And in any case, it's not common to find tagged video files even with non-standard formats, so we'd need to rely much more on filename heuristics. See also #1160.
I can see this going one of two ways:
dbcore
module contains the "soul" of beets without knowing anything about music.Your second point is interesting, @sampsyo. Could we perhaps move dbcore
into it's own repository and pip package?
Yep, that was always the long-term goal with the dbcore
refactor. If there's sufficient momentum behind a "beets derivative" for video, that might be the push we need to finish that up.
Filebot has everything you are looking for and more - it integrates perfectly for me between flexget and kodi.
I can't speak to video, but I just wanted to put in a vote for a beets ebook organizer. There really is not a KISS tool for ebooks. Calibre is, as far as I know, the only Linux software that scrapes and organizes ebooks; however, it's a train wreck, modifying files without warning the user, forcing exactly one organizational scheme, etc. Nothing like the elegance of beets!
I would be way more interested in being able to use beets to rename/organize video files and generate nfo files for kodi. Solutions like filebot are great, but anything that requires java to run and frequently breaks on updates does not interest me in the least.
I'd try to intergrate https://github.com/guessit-io/guessit into beets. It does a fantastic job of guessing information based on filename and the output would probably make a good starting search on tmdb or tvdb. I see the workflow better suited to a plugin (but i'm not a dev).
Man.... with the inline plugin.... this would be so awesome.
Until one day beets manage video files, I wrote flinck whose goal is to create symlinks to your movies files and organize them by year/genre/whatever ...
beet users should not feel disoriented, as it reuses some beets concepts : heavy use of confit
to configure the tree hierarchy, buckets to reuse user-created folders, google backend to guess movie original name from another country release name.
But it's not the swiss-knife for video files as beet is for audio : no database, no renaming, just symlinking really. I released a new version yesterday and would appreciate some feedbacks.
@sampsyo I think I'd prefer if video (and eBook?) management was kept in beets rather than split off into separate (even if related) projects. I already do use beets for some (music) video management, so it's not impossible to envision to me. Of course, this might make some feel beets become bloated, so maybe it'd be possible to "modularise" beet
to support various media types depending on what's installed? (E.g., I'd imagine I'd like to organise eBooks and audiobooks together, and personal videos, movies/tv series, random YouTube downloads, and music videos together.)
@Freso I concur that opting for modularity would be the way to go. The ability to scrape new media types seems like it belongs as a plugin, prodding the community to create plugins they desire. Sadly, I realize, eBooks will probably not be the community's highest priority... Not sure if others are aware, but MediaElch works quite well for video. Perhaps some of the scrapping work can be borrowed from that.
Well, the upshot is that modularity is a good idea for lots of reasons! Even boring ones like maintainability that have nothing to do with video or ebooks. So I'm all for it, especially if it helps us build tools that feel engineered for different use cases.
Somewhat related, but MusicBrainz actually includes video tracks (because some albums include bonus promotional videos and the like). As an initial step toward potential video library support, perhaps MusicBrainz video track support could be added to beets?
Oh my, I'd love it if beets could also take care of TV shows and/or movies. Sickbeard, SickGear, SickRage, MediaElch, FileBot etc. are all way too heavy and complex, especially if you just want to point the program to a directory with a tv series and have it rename all episodes appropriately.
I was thinking of Beets as I experimented with https://github.com/perkeep/perkeep -- the intended use cases and workflows are quite different, but having a central store with a flexible metadata system is something both of these systems share.
Perkeep could serve as an example of how to provide a number of modular "importers" which produce metadata in a single database. @sampsyo -- how modular is the import flow currently, and how hard would it be to extend to arbitrary file types?
You're right; there is a certain similarity in philosophies there! I'd be interested to explore this more deeply.
To answer your direct question, the importer pipeline is reasonably reusable, although there is a fair amount of music-specific logic mixed in there: mostly surrounding albums that group together individual tracks.
One thing that is very abstract, however, is the database layer. Take a look at our dbcore
package, which does everything having to do with items, their fields, and queries over them. That actually seems like a good point to overlap with Perkeep.
@sampsyo -- would it be helpful to track this effort as a separate bug? Something like "Modularize the importer and support file types without inline metadata"? Or do you feel this is outside of the scope of what should be supported by the Beets project?
I did take a look at dbcore! If one wanted to create a separate tool for importing, setting metadata, and querying over arbitrary local files, it seems like this would be a great place to start. Do you have strong feelings on whether that is the best route?
Sure; a separate thread sounds good! I guess the way I’d put the project is: let’s make the importer
module generic and reusable in the same way that dbcore
is. The idea would be to factor out the common logic from the music-specific stuff—without breaking beets too much in the process. :smiley:
With hard work in place, I can imagine it going one of two ways: either resume the same components (dbcore + this new importer module) to make a beets-like tool for video, or just extend beets for other media types in place. I have a less strong feeling about which of those is a better idea, but both seem worth exploring.
@sampsyo, it looks like the majority of the changes would need to go into beets/library.py or beets/mediafile.py -- LibModel and Library are mostly generic enough and beets/importer.py doesn't seem to know too much about the individual models, but Item and Album are very audio specific.
Video items might overlap enough with the fields in Item that it makes sense to support them in beets/mediafile.py, but generic files like text documents, binaries, source files, etc. wouldn't fit very well.
One approach would be to add distinct model/database types to beets/library.py for file types which don't have the typical music associated metadata. LibModel/FileItem (any file), MediaItem (common media related fields) VideoItem, AudioItem, AudioAlbum, ImageItem, TextItem, etc.
However, the ideal outcome might be to allow defining different media types as plugins so that the end user could choose which sorts of files they want to have in their library.
Naively, I could imagine something like:
class VideoModelPlugin(BeetsPlugin):
def supported_format(self, file_path: str, magic: str) -> bool:
return magic in ['video/mp4']
def attributes(self) -> Dict[str,beets.dbcore.types.Type]:
return {'director': beets.dbcore.types.STRING}
def parse(self, file_path: str) -> Dict[str,str]:
meta = _LoadMetadata(file_path)
return {'director': meta['director']}
Does this seem like a reasonable approach?
Yeah, that would be cool! I like the idea of model types provided by plugins. An inconvenient piece to deal with will be creating and destroying SQLite tables that back these models. I’d be interested to look into a more detailed design for how that would work.
Howdy, @sampsyo -- to prove to myself whether a tool like beets is the right one for this job, I threw together a prototype using dbcore for crawling non-music files. I've found that adding items to an on-disk (ext4) database is several orders of magnitude slower than an in-memory one.
For an import with only 847 records:
:memory: 3.2s
test.db: 3m7.1s
Each file has ~10 (non-flexible) attributes. I'm setting them all with a single model.update() call (which, from a quick code perusal, seems to result in an SQL 'UPDATE' query for each attribute). I was initially setting each attribute one per expression which (due to the parenthetical above) seems to have no impact on performance.
Am I using the library incorrectly or is this the expected performance?
Wow; awesome! Except for the performance.
I'm not sure what to "expect" for performance, but that's certainly not good—maybe this would be a good lens to use for performance optimization. Would it make sense to do a little profiling? (If so, may I recommend SnakeViz to explore the data?)
Was about to open an issue but luckily found out it is already been worked on! I will see if I can help with this,- I'm looking forward to catalogue my movies and series.
@sampsyo, I took it for a spin in snakeviz. Unsurprisingly, the majority of the time is being spent in sqlite3.Connection.commit
.
On the beets side, over 95% of the time is spent in dbcore.Model.add
, dbcore.Model.store
, and dbcore.Model.__exit__
. Baseline runtime was 153 seconds.
Removing an unnecessary store
call in the inner loop shaved about 30% from runtime. add
already calls store
once for each added record. Trimmed runtime down to 109 seconds.
Next improvement was setting the values for the entry at model instantiation time rather than a) instantiating with empty values, b) setting the values by attr or bulk update
, and then c) calling add
. Down to 76 seconds.
Next area for exploration may be supporting a bulk add
with a single sqlite transaction. I'm not sure how much this would impact performance.
To summarize, the overall control flow now looks something like this:
db = ExampleDatabase(db_path)
for file_path in file_path_list:
model = ExampleModel(db, att1=x, attr2=y, attr3=z, ...)
model.add()
db._connection.close()
It's worth noting that each add
results in an INSERT
with DEFAULT VALUES
as well as a subsequent UPDATE
to modify the dirty keys, even if all of the values were supplied ahead of time. This may be an area for optimization.
Awesome work here. That's sort of good news that we can blame our very inefficient database usage rather than anything running "in Python"!
Just to help me track this: where is the inner loop that you're referring to? That's in your own client code, right? (Not in beets itself?)
To summarize potential changes from the beets side that you mentioned:
store
(and therefore a separate database translation) on every model creation. I expect this would be a substantial win—even if the actual transactions themselves are pretty fast, the per-transaction overhead is probably a good chunk of the time spent on the model insert cost.add
to allow proper initialization with specific values (rather than using two transactions to create and then modify). This might be easier to do than the first thing and would halve the number of transactions, so it might be the right place to start.The store
within an inner loop was in my own code (conceptually, inside the for loop I demonstrated above). Your summary of improvements sounds right and agree that the latter one should be simpler to implement. I don't see an obvious way to do the first one without changing the dbcore API.
For Updating / Writing, the first one probably could be solved introducing the unit of work pattern (The trade-off would be more memory to keep track of the objects). A example exists in SqlAlchemy (Session)
The models would have a reference to the session (which is bad, IMHO) and one would commit after all operations are done (The default way would be always commit the changes, to not break the API, at least initially):
# pseudocode
class Model(object):
def store(self, mode='now'):
self.session.add(self)
if mode == 'now':
self.session.commit()
"We can solve any problem by introducing an extra level of indirection." hahaha
@sampsyo, I think the right target to shoot for is that the dbcore overhead should be less than the time it takes to crawl files on the target filesystem. Do you think this is achievable with SQL data store?
Here is a really nice article on this topic: https://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite -- they are using the C bindings for sqlite, but I suspect many of the lessons could apply for Python.
Yeah, that seems like a reasonable goal to at least shoot for. Have you checked, for instance, what the proportion of filesystem to database time is in the optimized version of your current crawler?
@sampsyo, database time is still over 92% of total runtime. I suspect this is also a significant bottleneck when importing large music libraries.
Got it. It does seem like this should be achievable in the limit—the main impediment is figuring out the right abstractions to allow clients to express a high-performance treatment of the database.
@sampsyo -- thinking a bit more, we can do something relatively uninvasive by providing a bulk_add
method on the Database class. Consider the following:
db = ExampleDatabase()
db.bulk_add(
ExampleModel({...}),
ExampleModel({...}),
)
Caveat emptor: in order to realize the performance improvements of bulk operations, callers would need to explicitly opt into this use.
I threw together a quick prototype of this and I'm seeing total runtime down to less than 5s, with less than 10% of total time spent in dbcore/sqlite3.
Some notes about the prototype:
Model.add
and Database.transaction
are modified to accept an optional txn
parameterModel.add
passes txn
parameter through to its Database.transaction
call.Database.transaction
is called with txn
param it is returned immediately.Database.bulk_add
creates a new transaction and passes it to each Model.add
call.Transaction.__enter__
as both Database.bulk_add
and the transitive Model.add
calls use a with
on that object.tx_stack
.Regardless, I think this is really promising and it's now relatively clear to me that pursuing a single transaction will yield the most significant performance increase.
Yes, absolutely! A bulk insert would be a great way to do it. You could even imagine letting the bulk_add
method accept an iterable (instead of just a list) to let new model objects get generated on the fly without materializing them all in memory.
This sounds awesome. Any chance you can put together a PR for closer review?
I think I'm of the team that this is outside the scope of beets, but a fork that deals with videos could be interesting. I think there's also a limited use case for this. As mentioned, there isn't really a good standard for tagging or a whole lot that's worthwhile to tag or update as time goes on. The biggest advantage I suppose would be the database querying, but having an application like beets just to provide a cli query for your videos seems overkill when the common uses for a query could be implemented through other commandline utilities by parsing an organized video directory, or with GUI applications.
What I would recommend for the majority of people is:
If there's still interest, I think the best route, as suggested, is to create a fork of beets focused on videos.
I'm going to close this since this doesn't seem like something beets should implement, but feel free to continue discussion here or on discourse.
I was wondering if there was any interest in adding video library support to beets. I really like the workflow of importing my media and playing it with
beet play
without the overhead of a full blown media player. I was curious what would be necessary for beets to implement video support. So far what I think would be required at a minimum is this:I don't really think this is in scope for the beets project and a lot of it won't carry over into video organizing but I can't seem to find a media organizer for videos that doesn't require a server or a gui (Plex, Kodi, Emby, etc) and most of those projects require manual editing of filenames in order for the lookup to succeed.
I was wondering what everyone thought about this functionality being in beets or another project similar to beets. I know I'd find it useful but I'm not too sure if it's worthy of being incorporated into the main project.
Desired Workflow
Ideally this would be the workflow I'm looking to achieve:
beet import The\ Princess\ Bride.mp4
.--as-video
flag would be needed for an alternate import method.Then after this is complete
beet play Princess Bride
would start playing that file with the configured media player.