Borewit / music-metadata

Stream and file based music metadata parser for node. Supporting a wide range of audio and tag formats.
MIT License
926 stars 92 forks source link

Roadmap v1.0 #4

Closed mainrs closed 7 years ago

mainrs commented 7 years ago

I just copied the goals into this issue to keep track of them. We can later just create a checklist out of them.

Edit: Concerning the refactoring. It would probably be better to start from scratch and use the current code base as a reference. Changed the title to Roadmap v1.0. Refactoring will probably break the API. Bumping the version up might be a good idea.

mainrs commented 7 years ago

Since I have some free time over the weekend, any definite ideas on how to design the refactored code? Or is the idea generally the same (IStreamParser interface to provide general contract and implementing it for each data format). I would love to start working on the project :smile:. Do you prefer a fork or would you allow me to work as a collaborator under a new branch @Borewit?

Borewit commented 7 years ago

Following the GitHub best practices, I suggest you start with a fork, and create a branch withing your own forked repo to do the work.

It would indeed mean we have to update the IStreamParser, which is passing the NodeJS.ReadableStream . The stream is basically reading as fast as it can, broadcasting events, triggering the parsers to consume the data. It is not very nice in a situation where you want to do something like: id3-parser, does peace data look like a valid header to you? Parser says "no", let's try to feed the same data to the next parser.

These changes are huge, and may not be the best starting point. Not very likely you gonna complete that in your weekend 😉.

Start from scratch sounds scary to me. Even if it turns out that that we had to re-write every single line, building from the current repo is good thing. The biggest things are build in small steps.

What about you? Would you like some specific functionality to be added, or other ambitions / objectives?

mainrs commented 7 years ago

I am currently writing a hackable music player, inspired by Atom. I would love to have the ability to also write tags. Ye I know, it is a huge chunk of code that will be changed. My thought process was that we need to change the IStreamParser interface anway and since that is the core of the library, we might as well start from zero. It would allow us to reorganize some stuff and take other things in account (like unit tests) from the beginning.

Yep, the weekend won't be enough for it. My last exam is on Thursday next week and after that I have 10 days free :smile: Hopefully I am getting something done in that time :smile:.

like: id3-parser, does peace data look like a valid header to you? Parser says "no", let's try to feed the same data to the next parser.

What exactly do you mean?

Do you already have some ideas on how to make the parsing more effective? You already gave the idea to allow parsers to jump and seek bytes. That doesn't have anything to do with the IStreamParser interface, it would be implementation specific. It might not be that much of a change to the core now that I reflect it again... Edit: By aware though, I am fairly knew to JS. I have a good amount of experience with C#/Java though. My code might not be THAT good...

Borewit commented 7 years ago

You look like a smart guy to me. I have a C#/Java background myself. The JavaScript world is like restaurant menu card, it keeps changing and there are just too many things to choose from.

IStreamParser cannot jump to specific point in the stream. You have to digest the stream in order. You cannot jump to end of file -100 bytes. The best you can do is read all the way until (ignore the data in between) until the end of stream is reached, end then parse go back to the last 100 bytes we kept somewhere. Same applies in case you are not sure what content to expect, you cannot read ahead and go back. "Random file access", which I referred as "seek" would be the solution, would also improve performance. You got the point 👍, if you want to write (to update the metadata), the read stream is in our way, we should get rid of it.

Starting from scratch... We got 941 tests checked in, an excellent baseline we can use as a regression test how well progress compared to the "old" architecture. The pain of having old code in the way, can also be seen as a push to come up with a workable solution in intermediate releases. Branches are invented to deviate a way from the stable code. Using this code as a reference, I don't see why we should erase its history. But, yeah, painful it will be such a big change :wink:. Anyway I will give a thought.

mainrs commented 7 years ago

Ok. I think the argumentation about using the tests as a baseline makes sense. Wouldn't it be then better to first port the tests to a testing framework and then start to change the actual code?

Edit: Might be a good idea to change the test files too. Some files are 15MB and even contain copyrighted material. Generating test files shouldn't be hard at all. What do you think? We just need "empty" music files for testing since we usually do not care about the actual audio data that is compressed/packaged. Porting the tests to ts doesn't offer any benefit though. It just brings up more problems (definition errors and stuff like that. There is no real need to move to ts).

Just a question: How exactly does the test for the event emitters work? I looked into the code but couldn't figure out if the events get even called. Looks like they get registered but the file already starts to parse before the event emitter is returned (for exampel test-afs.js).

Borewit commented 7 years ago

The test cases would a good starting point.

Regarding the sample files, most of the files I inherited and all files I added, are limited to a very small samples. I guess that is no copyright violation. Would be nice to shrink the larger ones.

The file starts to be parsed immediately. I would expect that the registration of the event handlers are done prior to the first callback. As far as I know JavaScript is single threaded; only asynchronous calls are executed in parallel. Its like a GUI thread.

I suggest to cut down the roadmap into smaller issues.

I will do my best write out some smaller improvements as well.

If you are not familiar with the tool already, please have a look at Picard. A lot of what I added to the original module is based on how this tool works. Try to add some meta-data to some of your music files. Don't make the mistake to blindly update all your music file, do a single album at a time and review before saving. Select the release manually, Picard will pick one, even if it is unsure what to choose. This is where it all about, if you link with the correct release, it will MusicBrainz augment your music files with the right tags. Based on this identification, If updates are available, you can easily update the music files, or get additional which you cannot even store in meta tags.

mainrs commented 7 years ago

The file starts to be parsed immediately. I would expect that the registration of the event handlers are done prior to the first callback. As far as I know JavaScript is single threaded; only asynchronous calls are executed in parallel. Its like a GUI thread.

I wrote a sample test using a testing framework, like mocha. I parsed the file and registered the event callbacks within their own test method. Looks like the file gets immediately parsed as soon as you call the parseStream method. Because any event callback registered after the method doesn't even get called.

This might cause some race conditions if I am not mistaken. If the file is too small and you register too many callbacks, there should be a point where some of them do not get executed.

Borewit commented 7 years ago

If JavaScript would run concurrent that would be an issue. But is doesn't. So the whole test block will initialize, including event registration, without any interrupt (ref: Run-to-completion).. So the JavaScript parsing will start after that.

In other words, reading the audio file (by your OS and Node.js) may already happen in the background, but the event handling and parsing it with actual JavaScript will only occur when the previous event handling completed. And everything runs in this event manner, also the initialization of your unit test.

AoDev commented 7 years ago

Hi. I just found this "fork" from musicmetadata that was not being updated anymore. That's very good that you are continuing. I am also building a music player and currently wish performance to be improved while parsing metadata. Especially I wrote this issue: https://github.com/leetreveil/musicmetadata/issues/142 and was wondering if it could be done here.

Borewit commented 7 years ago

Hi @AoDev. I doubt if ignoring images would result in significant faster parsing of MP3 files. As discussed in the top of this thread. I believe that enabling random file access will have a much greater impact. Have you already tried to set the duration flag to false?

AoDev commented 7 years ago

@borewit Not sure if you closed this issue because you don't wish people to comment on it anymore? You could move the roadmap into the readme or some roadmap.md. I like projects where goals are written. Especially to send pull request :)

Back to the issue, I mentioned performance but it's not to get faster processing. It's for memory.

I use this module to build a media library database by scanning the filesystem. I do this in an app built with Electron. There is an issue where the reference to a remote object could be lost and the parsing process would stop. https://github.com/electron/electron/issues/8205. While on the Electron side it was not clear why it happened, I realised that skipping the manipulation of the image during the process prevents the issue from happening. So while not sure, I thought reducing memory usage was a good strategy.

Borewit commented 7 years ago

By closing this issue, I am hope to prevent all issues ending up here, I broke down the road-map into individual issues, I moved your request to: #11.

WholeMilk commented 5 years ago

Is there still plans to support CRC ?

Borewit commented 5 years ago

@WholeMilk: Not directly, can you explain what your use case is? Preferable in a new issue.