fasterthanlime / jsmad

:saxophone: Javascript MPEG-1 Audio Layer III (mp3) and ID3v2 decoder
http://audiocogs.org/codecs/mp3/
762 stars 41 forks source link

Patch 4 #16

Closed gliese1337 closed 13 years ago

gliese1337 commented 13 years ago

I dove into the decoder to figure out why switching from strings to typed arrays was causing problems, and discovered that there's tons of code that requires strings! And holy cow, so many different methods all named .read().... I successfully patched up the FileStream to read an ArrayBuffer instead of a byte string, but it results in slow startup and updating all of the frame-decoding code to convert buffers back to strings was totally not worth it. It would probably be worth the effort if we were doing a port to node.js and could make use of node's memory buffers.

This is a hybrid solution that gives separate access to string data and binary data to the separate parts of the decoder that want different kinds of data. Implementations of .get() and .getb() convert whatever happens to be there (binary or string) into whatever is currently being asked for as needed to avoid loading delays. If the browser supports DataView, it's a major performance gain; if not, it's a wash for string-backed streams, and a gain for buffer-backed streams.

Along the way, I made some minor performance improvements. Still a lot to be done there.

Greater gains could be made by shifting the DataView creation back into the concrete stream .get() methods, propagating all of the dependencies for that is just ridiculous unless you can assume browser support, and I doubt the polyfill version of DataView has sufficient performance characteristics to make it worth using right now.

nddrylliog commented 13 years ago

tons of code that require strings!

I believe the most significant part is the whole id3v23 and id3v22 code. Otherwise, much of the actual MPEG decoding relies on bit.js, which can easily be adjusted to use typed arrays instead :)

If the browser supports DataView, it's a major performance gain; if not, it's a wash for string-backed streams, and a gain for buffer-backed streams.

Oh right, so subarray() and creating a new UInt8Array actually copies all the elements in the new array? I hadn't seen that. So yes, obviously DataView seems the way to go.

Greater gains could be made by shifting the DataView creation back into the concrete stream .get() methods, propagating all of the dependencies for that is just ridiculous unless you can assume browser support, and I doubt the polyfill version of DataView has sufficient performance characteristics to make it worth using right now.

I'm not entirely sure what you mean by "propagating all of the dependencies" ? There is not much code in the actual MPEG decoding that uses actual strings, or that, at least, couldn't change easily to Uint8Array. All the 'buffer/this_frame/next_frame' stuff in Mad.Stream should be changed to use Uint8Arrays imho.

Also, I'm beginning all this should be merged in a branch, not in master yet, as there are too many potential changes to all parts of the code, that would make devleoping other features harder for now. I'll merge this later in the day - sorry for not answering your pull requests immediately, there's just so many hours in a day! :)

Cheers

On Tue, Jun 21, 2011 at 6:01 AM, gliese1337 < reply@reply.github.com>wrote:

I dove into the decoder to figure out why switching from strings to typed arrays was causing problems, and discovered that there's tons of code that requires strings! And holy cow, so many different methods all named .read().... I successfully patched up the FileStream to read an ArrayBuffer instead of a byte string, but it results in slow startup and updating all of the frame-decoding code to convert buffers back to strings was totally not worth it. It would probably be worth the effort if we were doing a port to node.js and could make use of node's memory buffers.

This is a hybrid solution that gives separate access to string data and binary data to the separate parts of the decoder that want different kinds of data. Implementations of .get() and .getb() convert whatever happens to be there (binary or string) into whatever is currently being asked for as needed to avoid loading delays. If the browser supports DataView, it's a major performance gain; if not, it's a wash for string-backed streams, and a gain for buffer-backed streams.

Along the way, I made some minor performance improvements. Still a lot to be done there.

Greater gains could be made by shifting the DataView creation back into the concrete stream .get() methods, propagating all of the dependencies for that is just ridiculous unless you can assume browser support, and I doubt the polyfill version of DataView has sufficient performance characteristics to make it worth using right now.

Reply to this email directly or view it on GitHub: https://github.com/nddrylliog/jsmad/pull/16

Amos Wenger Community Development Engineer official.fm - "The Do It Yourself Music Club" I'm @nddrylliog https://github.com/nddrylliog on GitHub

gliese1337 commented 13 years ago

I believe the most significant part is the whole id3v23 and id3v22 code.

Also layer3.js (constructing main_data). That should probably be using ArrayBuffers (especially for memcpy!), but my attempt to make the transition resulted in a lot of opaque frame decoding problems; it'd do one frame just fine, and then get misaligned or something. I suspect you'll need a new BufferStream constructor in addition to the existing StringStream to handle feeding a buffer-based main_data into Mad.Bit().

Otherwise, much of the actual MPEG decoding relies on bit.js, which can easily be adjusted to use typed arrays instead :)

Fortunately, it doesn't even need to be adjusted! bit.js only ever access the underlying stream through .getU8(), so as long as that's accessing an ArrayBuffer in the underlying stream, bit.js automatically gets the benefits.

Oh right, so subarray() and creating a new UInt8Array actually copies all the elements in the new array?

Oh, no, it creates a new view on the same memory location. The problem is that .get() doesn't know what the caller is going to do with the data it returns, so it has to return something generic- a plain buffer or a Uint8Array. The only way to avoid that would be to provide stream-specific implementations for all of the get[Type]() functions. If you have DataView, that's fine, get() can return a DataView and you can extract whatever type you want out of it. If you don't have DataView, though, to get the same functionality you have to construct yet another new typed array object to wrap whatever .get() returns, and do checks for native endianness to make sure it'll handle the conversion correctly.

So, here're 4 possibilities for calls to get[Type](): Buffer-backed stream, with native DataView: one DataView object constructed in get(), one method call on it to extract the wanted integer. Buffer-backed stream, no native DataView: Subarray of Uint8Array created in get(), a new Typed Array created in the caller or individual bytes composed depending on native endianness check. String-backed stream, native DataView: Data copied & DataView created in get(), one method call on it to extract the wanted integer. String-backed stream, no native DataView: Data copied & Uint8Array created in get(), a new Typed Array created in the caller or individual bytes composed depending on native endianness check.

If the browser supports DataView, then you only ever have to construct one new object and do one native method call on it. If you don't have native DataView, though, buffer-backed streams avoid any charCode conversions and might be able to use native functions with one extra object-creation, and string-backed streams just end up with all of the charCodeAt() calls moved up one level into .get().

I'm not entirely sure what you mean by "propagating all of the dependencies"

Well, if you have native DataView, you'd like to use it. If you don't, there's no point, because if you use a polyfill you'll incur the cost of creating a shim-object in addition to doing all of the other by-hand byte processing that's going on anyway to simulate the behavior of DataView. So, if we were to move the conditional creation of a DataView or a Uint8 subarray up into .get(), all of the callers of .get() would have to check which kind of object they got back before finishing the calculation (API dependencies are moving up the function stack from .get() to its callers) . You can avoid a performance penalty by assigning different implementations of all of the get-using functions based on a check for DataView availability at load time, but that's icky! Nobody wants to write and maintain all that duplicate code.

So, for the moment, I've minimized the amount of conditional code by doing the DataView construction in .get()'s callers rather than .get() itself.

All the 'buffer/this_frame/next_frame' stuff in Mad.Stream should be changed to use Uint8Arrays imho.

Almost certainly, yes.

sorry for not answering your pull requests immediately, there's just so many hours in a day! :)

Heck, I've already got the code myself, so no problems here. I'll point out, though, that I've pretty much obsoleted my original pull request already. Although, that code may still be useful as reference for de-duplication techniques.

nddrylliog commented 13 years ago

Okay, so this pull request cannot be automatically merged, apparently - I'll try to cherry-pick your commits one by one.

nddrylliog commented 13 years ago

I've pulled all the changes, AjaxStream seems to be working fine (haven't tested performance yet) but indeed, it broke FileStream. I'm running out of time to spend on it for this month, but I'll definitely have a look at it from August 2 on.

If Jens or someone else has the will to review the changes, I've just pushed them under the branch "typedarrays" on this repo (with resolved conflicts as a bonus)