Closed guest271314 closed 4 years ago
Why is this needed? There is no requirement that the browser stores
data in this planar (or interleaved) format. The existing APIs
explain how you get and/or store audio data in, say, an AudioBuffer
,
or the data from an AnalyserNode
. How it's actually stored is an
internal implementation detail.
Why is this needed?
Thorough technical details as to what is actually occuring on with regard to Web Audio API specification.
If we construct two Float32Array
s by hand, not an AudioBuffer
and set L channel and R channel as floating point numbers in each typed array
LLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRR
and use process()
of AudioWorklet
, will the output be the same as setting the values
LRLRLRLRLRLRLRLRLRLRLRLRLRLRLRLR (for a buffer of 16 frames)
be the same as the former?
How it's actually stored is an internal implementation detail.
Does that mean that either option above can be used at any arbitrary implementation and yield the "same" audio output result?
The specific code that gave rise to this question is https://stackoverflow.com/a/35248852
// This is passed in an unsigned 16-bit integer array. It is converted to a 32-bit float array.
// The first startIndex items are skipped, and only 'length' number of items is converted.
function int16ToFloat32(inputArray, startIndex, length) {
var output = new Float32Array(inputArray.length-startIndex);
for (var i = startIndex; i < length; i++) {
var int = inputArray[i];
// If the high bit is on, then it is a negative number, and actually counts backwards.
var float = (int >= 0x8000) ? -(0x10000 - int) / 0x8000 : int / 0x7FFF;
output[i] = float;
}
return output;
}
where in the code there is only one channel of output.
When converting a WAV file (streaming from fetch()
with Content-Length
that crashes decodeAudioData()
, which is not necessary or useful in this case) to Float32Array
using that code, when more than one channel is encoded the current code that am using based on the above matches description of "interleaved"
function int16ToFloat32(inputArray) {
let ch0 = [];
let ch1 = [];
for (let i = 0; i < inputArray.length; i++) {
const int = inputArray[i];
// If the high bit is on, then it is a negative number, and actually counts backwards.
const float = (int >= 0x8000) ? -(0x10000 - int) / 0x8000 : int / 0x7FFF;
// toggle setting data to channels 0, 1
if (i % 2 === 0) {
ch0.push(float);
} else {
ch1.push(float);
}
};
return {
ch0, ch1
};
}
Is the result of that code consistent with an Web Audio API AudioBuffer
from decodeAudioData()
for a two-channel WAV file? Or should the code produce two Float32Array
s for such a WAV
LLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRR
as described by the MDN article?
Is the information accurate relevant to either ordering and based on
How it's actually stored is an internal implementation detail.
it does not matter, either option can be used in an implementation agnostic manner and the same result with be output?
I think AudioBuffer.getChannelData
pretty much explains it, at least for an AudioBuffer
. The data are accessed in a planar fashion.
And process()
for an AudioWorklet
uses an API with planar data too. Each Float32Array is one channel. You can't put LLL...LLRRR...RRR
into the array and expect an AudioWorklet to produce the right results. You have to use 2 arrays, one containing L values and the other containing R values. I think that's pretty clear from the description of the input and output parameters for the process method.
How the data is stored or processed in any other place is not visible so implementations are free to do whatever they want.
@rtoy If gather your post correctly even if the data is stored interleaved in discrete Float32Array
s the read of that data to produce audio output internally is converted to planar? How would that work for 4 channels from a WAV file manually parsed into Float32Array
s?
What section of the specification describe the conversion of interleaved to planar?
There is no section to describe that because it's all internal. However, chrome does, in fact, use planar for everything. Technically it doesn't have to and there's no way for you to tell.
As an example, let's say you've created an AudioBuffer with four channels. This is, of course, planar. Now create an AudioBufferSourceNode
with that buffer. Internally, this could copy out the data and convert it to interleaved in some hidden internal buffers. Connect this to a bunch of downstream nodes and the output. There's no way to know that this was done. And placing an AudioWorklet in the graph just means the interleaved data is deinterleaved to planar for you and vice versa.
Yes, this is all rather wasteful in memory and CPU, but an implementation could do that if desired.
I don't see any need to describe this. The API that exposes planar data is properly described. Internals are internal, and you can't see it so it doesn't need to be described in the spec.
It is interesting that "Planar versus interleaved buffers" is described at MDN, yet not at the specification.
Your general point revelvant to the specification appears to be that what am requesting technical clarification for is moot because from your perspective the implementation of Web Audio API is a black box that is not observable.
More information is available about the Web Audio API with regard to planar and interleaved at a source other than the specification itself.
If
Yes, this is all rather wasteful in memory and CPU
is true and correct, then technically
There's no way to know that this was done.
cannot be true and correct at the ame time. As if a different method was used by the internal implementation then the user could observe the difference; less waste of memory and CPU is certainly observable as to performance of the device itself. So, that means you can "see it" right now, if you actually look and ask. However, if the question is never asked, or cloaked in by a "internal implementation" veil that users in the field are simply expected to accept without question at all, then yes, you "can't see it".
Here, do not simply accept the surface layer of explanations for anything.
As an example, would have never reached the point of isolating why Chromium consistently crashed when variable width and height frames were played back if had just stopped experimenting, testing and asking questions due to the video decoder/encoder being an "internal implementation". The same is true for other discliplines that have been and am engaged in, particularly history and science.
The "internal implementation" is "not observable" would-be barrier to further analysis does not stop research here. If anything, when such assertions of non-observability are made the first step in the process here is to verify if that is true.
Yes, if conversion to and from planar to interleaved is expensive at Chrome, as you indicated, then by necessity such internal implementation is observable: Your post describing "wasteful" proactices is an observation of the internal implementation that you must be able to "see" in order to describe in such detail. That detail needs to be explained for all who use the Web Audio API.
Yes, memory and CPU usage is noticeable. But not the audio produced. By "observable" I meant you can't use webaudio and JS to tell how things are implemented internally.
You can have a really wasteful ConvolverNode that uses lots of memory and CPU. Or perhaps one that uses little memory and lots of CPU or more memory with less CPU. The spec doesn't care. It's up to the implementation to do the tradeoff that is appropriate for the implementation. In any case, the output is the same (ignoring floating-point round-off issues).
If a browser wants to be wasteful, more power to them. But I think the API is clear: planar for AudioBuffer, AudioWorklet, ScirptProcessor. We don't need to describe whether planar or interleaved or a mix is used anywhere else.
The spec is a description of what things do, not how they're implemented internally except as constrained by what is supposed to be produced.
Crashes are bugs in the implementation, not necessarily in the spec. (Although sometimes they are because the spec was incorrect.)
By "observable" I meant you can't use webaudio and JS to tell how things are implemented internally.
Yes, you can https://bugs.chromium.org/p/chromium/issues/detail?id=1001948, et al. This entire branch https://github.com/guest271314/MediaFragmentRecorder/tree/chromium_crashes is dedicated to doing just that.
Consider https://github.com/padenot/ringbuf.js/blob/master/js/audioqueue.js
// Interleaved -> Planar audio buffer conversion
//
// `input` is an array of n*128 frames arrays, interleaved, where n is the
// channel count.
// output is an array of 128-frames arrays.
//
// This is useful to get data from a codec, the network, or anything that is
// interleaved, into planar format, for example a Web Audio API AudioBuffer or
// the output parameter of an AudioWorkletProcessor.
export function deinterleave(input, output) {
var channel_count = input.length / 256;
if (output.length != channel_count) {
throw "not enough space in output arrays";
}
for (var i = 0; i < channelCount; i++) {
let out_channel = output[i];
let interleaved_idx = i;
for (var j = 0; j < 128; ++j) {
out_channel[j] = input[interleaved_idx];
interleaved_idx += channel_count;
}
}
}
// Planar -> Interleaved audio buffer conversion
//
// Input is an array of `n` 128 frames Float32Array that hold the audio data.
// output is a Float32Array that is n*128 elements long. This function is useful
// to get data from the Web Audio API (that does planar audio), into something
// that codec or network streaming library expect.
export function interleave(input, output) {
if (input.length * 128 != output.length) {
throw "input and output of incompatible sizes";
}
var out_idx = 0;
for (var i = 0; i < 128; i++) {
for (var channel = 0; j < output.length; j++) {
output[out_idx] = input[channel][i];
out_idx++;
}
}
}
This should be in the specification. At least describe the difference, particularly relevant to cost of CPU and memory. If Web Audio API specification is the authoritative source for Web Audio implementations then description of the impacts of using a specific approach to implement the specification should be described.
Crashes are bugs in the implementation, not necessarily in the spec. (Although sometimes they are because the spec was incorrect.)
Again, in the case of variable width and height recording and playback if had simply stopped asking questions would never had isolated why this code https://github.com/guest271314/MediaFragmentRecorder/blob/master/MediaFragmentRecorder.html consistently crashed Chromium for years.
In the case of Picture-In-Picture window specification https://github.com/w3c/picture-in-picture/pull/186, the mere recommendation to restrict PiP window size, that Chromium implements https://bugs.chromium.org/p/chromium/issues/detail?id=937859, ironically provides a vector for "fingerprinting" the user screen - while never actually stating why the recommendation is in the specification in the first place.
When Web Audio API specification contributors write code to convert to and from planar to interleaved in their own repositories, the specific subject matter is not insignificant. Yet, in order to even be abreast of that subject matter a user in the field cannot learn about that technical difference in the controlling specification document, rather as it stands, learns from MDN, and reading code at large. That is an obvious omission that can be fixed by simply explaining the difference between the two. In this case the term "interleaved" does not appear once in the specification, yet is undeniably a consideration re conversion to and from planar and interleaved, as you indicated, such conversion can be expensive, observable at the device itself.
Why should users have to rely on MDN to describe what Web Audio API actually does - instead of the primary source document?
You and your colleagues that contribute to the specification are the experts in this domain. Am asking the experts for clarification and detailed description of the difference between the two data structures - to be included in the specification, as that document is the primary source. Am not sure why there is any objection to that. If the subject matter was insignificant specification authors would not be writing code to perform the conversion; MDN would not have included the description in their article; you would not have mentioned that such conversion could be wasteful.
This is a reasonable request for more information. At least a brief write-up, or non-normative note describing the difference between the two, that implementers are naturally free to store the data in any manner they see fit. To omit entirely the technology involved is an omission that results in this very question. If users in the field cannot get an answer from you, the expert in the field, then users are forced to rely on secondary sources. Primary sources are always preferred to secondary sources, whether the field be journalism, science, or any other human activity where a source for information is necessary to understand the full scope of the subject matter and event horizon (astrophysics).
Where primary sources are not available, conjecture and confusion, ignorance and franky, folklore, ensues - rather than actual primary source data. One example in the domain of history is the folklore of "Betsy Ross sewed the first American flag", which historically is inaccurate. When further researching the origin of the first U.S. national flag, it is inevitable that any researcher will encounter the Grand Union Flag or Continental Colors; from there it is inevitable that the researcher will find that the Grand Union Flag is, save for a diagonal stripe on the Union Jack, identical to the pre-existing East India Company (E.I.C.) Flag. However, there is no primary source explanation for why the stripes on the U.S. national flag - the same stripes that appears on the pre-existing East India Company flag - and the flag of Goes, a municipality in the Netherlands, were evidently copied from the pre-existing East India Company Flag. Heraldry is not willy-nilly. An entire book was written just on that topic, where the answer is still not clear - based on primary sources - as that decision was never explained, at least not in any primary source documents that I have been able to locate. The complete history of the Confederate States of America national flag is far more detailed than the history and origin of the U.S. national flag. We cannot ask the primary sources why they decided to use the identical design as the pre-existing East India Company flag. We can only follow the threads of actual historical evidence to theorize reasons. We find that the historical event the Boston Tea Party took place on an E.I.C. ship. The seed funding or articles of value for what would become Yale University was donated by Elihu Yale, a sacked president of the E.I.C., and other antecedent historical data re E.I.C. and the Colonies which would eventually form the U.S. But we still do not have primary source document unequivocally detailing why stripes on the U.S. national flag - the same stripes that appear on U.S. national flag today. We could attribute the stripes to being derived from Goes, but if we have not researched the original design of the Great Seal of the United States, then we would be at a deficit, because on one side of the Seal are symbols representing six European nations, including a Belgic Lion, hence a potential reference to the stripes on the flag of Goes. The E.I.C. was doing business in the Colonies, so that gives us a direct connection. Still, we have no primary source stating exactly why the stripes on the U.S. national flag is not an original design.
Here, we have the opportunity to avoid ambiguity. Am asking the Web Audio API authors - the primary source in this domain - to describe the difference between planar and interleaved audio data structure. The term "planar" is in the specification, the term "interleaved" is not - yet clearly "interleaved" is technically relevant to the implementation of the specification, and thus cannot be non-observable. When primary sources are available, that source must be relied on, instead of third-party information, which amounts to hearsay.
The spec is a document for those "skilled in the art" as the saying goes. It is specifically not a tutorial. MDN is more a teaching/tutorial resource.
We can add a definition of planar to the spec. I'm not opposed to that. I do not see any reason to define interleaved.
AFAIK, no browser uses interleaved audio internally, except maybe when getting decoded audio. (Can't remember how ffmpeg works here.)
Hmm. I searched the spec for the word "planar". I can't find it. Can you point out where you saw this in the spec?
Get the "skilled in the art" part. Am not asking for a tutorial. Am asking for the primary source document to not omit critical antecedent information. That is why specifications include a bibiliography.
You are correct re "planar" not being in the specification. What am stating is that neither "planar" nor "interleaved" are described with a modicum of detail.
What we have is
1.4. The AudioBuffer Interface This interface represents a memory-resident audio asset. Its format is non-interleaved 32-bit floating-point linear PCM values with a normal range of [−1,1], but values are not limited to this range. It can contain one or more channels. Typically, it would be expected that the length of the PCM data would be fairly short (usually somewhat less than a minute). For longer sounds, such as music soundtracks, streaming should be used with the audio element and MediaElementAudioSourceNode.
Here we have the term "non-interleaved" yet "interleaved" is not defined. "non-interleaved" leaves open any option that is not "interleaved", yet "interleaved" istself is not defined.
If this
AFAIK, no browser uses interleaved audio internally, except maybe when getting decoded audio. (Can't remember how ffmpeg works here.)
is true, then both "interleaved" and "planar" should be referenced by primary source citations to the controlling definition of the terms, as-applied, in the specification.
As it stands even one "skilled in the art" cannot rely on the specification for even non-normative references to exactly what is meant by "non-interleaved", and certainly not what "planar" means, yet browsers are using "planar" to implement the specification - or maybe they are not?
OK. I'm not opposed to clarifying non-interleaved. But I think those "skilled in the art" know what that means.
OK. I'm not opposed to clarifying non-interleaved. But I think those "skilled in the art" know what that means.
How can you verify that assessment without a controlling definition of the term in the specification, without any room for another "skilled in the art" to disagree? Is there only a single definition for "non-interleaved" in this field? If there is only one possible definition of interpretation of that term, then there should not be any issue including that singluar, controlling definition in the specification. At a bare minimum a non-normative Note specifically citing the definitions relied on in the specification will avoid potential for ambiguity. If you know exactly what the term means, then print that in the specification. From a historical, scientific, and research perspective, even if the document is for individuals or institutions "skilled in the art", a table of Definitions to refer to is always advantageous.
As mentioned above, I'm not opposed to clarifying "non-interleaved". A glossary can be added. But it's really hard to know what to add there. Something obvious to you may not be to me so should it be added? Hard to say.
It is hazardous to merely assume that any individual or institution that claims to be or presumed to be "skilled in the art" to infer meaning of a term where none is clearly defined.
For a far more egregious example, in law and history, is "race" theory. Individuals and institutions, particularly in the U.S. promulgate "race" theory, and in general, individuals and institutions continue to promulgate the folklore that so-called "black" "race" or "white" "race" exists, yet of the thousands of individuals that have asked the very basic question: "What primary source definition of the term 'black' 'race' or 'black' 'people' are you relying on?" and : "What primary source definition of the term 'white' 'race' or 'white' 'people' are you relying on?" not a single individual or institution has answered the question referring to the controlling administrative definition of "race", "black" and "white" in the U.S. And of, course, "black" "race" and "white" "race" do not officially exist in either France or Germany, so "skilled in the art" means different things depending on environment and the determination of the individual to get to the truth, instead of relying on mere conjecture; as there is no such thing as "black racial groups of Africa" or "Middle East" or "North Africa" The theory of “Black” “race” in the United States: “black racial groups of Africa” do not exist and The theory of “White” “race” in the United States: “North Africa” and “Middle East” do not exist., thus it is very easy, for intellectually lazy individuals to simply rely on what the American Association of Anthropologists deem "European folk taxonomy", instead of doing the work to get to the source of what is a grand fraud.
As mentioned above, I'm not opposed to clarifying "non-interleaved". A glossary can be added. But it's really hard to know what to add there. Something obvious to you may not be to me so should it be added? Hard to say.
Well, that is precisely what this issue is about. Why is it "really hard" to include a definition of a term that you just today claimed
But I think the API is clear: planar for AudioBuffer, AudioWorklet, ScirptProcessor. We don't need to describe whether planar or interleaved or a mix is used anywhere else.
OK. I'm not opposed to clarifying non-interleaved. But I think those "skilled in the art" know what that means.
The term must not be clear, and if you find isolating a definition for that term "really hard" then solve the problem by determining the precise primary source definition that the specification is going to rely on. For "interleaved", "non-interleaved" and "planar".
We don't use planar or interleaved anywhere. No need to define these. We could just delete the one use of "non-interleaved" for AudioBuffer because that's an implementation detail. The API implies that to be efficient you should do certain things.
In fact, internally, the buffer doesn't have to be 32-bit floats either. Firefox can use 16-bit integers in certain circumstances, but you can't tell from the audio output or the API that this is done.
"implies" is essentially the same as inferring that those "skilled in the art" all agree on the same definition, which is an unverified theory. The scientific method requires reproduction of a theory, preferably by someone other than the claimaint, to verify the theory.
At one point you mentioned impact on memory and CPU, which is observable. Perhaps now you have qualms about stating that, which necessarily means the impact on audio output is observable, due to the load on memory and CPU. If audio output is not affected, yet other processes are affected, that is still an observation.
The issue is not about what an implementation does internally, the issue is about defining the technical terms corresponding to an actual technical implementation of the specification. Since "planar" or "interleaved" or "non-interleaved" are the apparent options for implementation, those terms should be defined for scope or possible implementations.
"non-interleaved" necessarily requires a definition of "interleaved".
Since you are "skilled in the art" and still find defining "non-interleaved" "really hard" the term should not be in the specification if you do not want to clearly define the term.
It would be beneficial to include a glossary of the terms used. Even among those "skilled in the art", in any discipline, there could still be disagreement as to what terms mean.
Either define the term "non-interleaved" or remove the term from the specification, to avoid conjecture.
You have to assume some basic level for "skilled in the art". If you don't, you end up having to define everything. And to be facetious, this includes defining "audio" or even "the". My basic level is that "non-interleaved" is understood as basic knowledge.
As for observable memory and CPU, that's outside the scope of the specification. The spec says you have this node and when given this set of inputs and parameters you get some output. How you get that is up to you. You can be wasteful of memory and CPU if you want. Or be clever. This is all outside the scope.
Yes, defining every is work. It provides certainty. This issue is specific to the term "non-interleaved" which you already stated is "really hard" to define. That is not reliance on "basic knowledge" for an individual "skilled in the art", thus you inserted the pre-condition "basic level" even within the scope of "skilled in the art". Since "non-interleaved" is in the current iteration of the specification, and as yet has not been clearly defined, am asking that specific term be defined in the specification. A term cannot be "basic level" and "really hard" to define for one "skilled in the art" at the same time. That defies logic.
I did not say "non-interleaved" is hard to define. It's not. It's hard to know what to put in a glossary.
At this point, I was just want to remove it from the AudioBuffer section and maybe just say the it nominally contains linear PCM samples. Nothing about interleaving, floating point or even the range.
I did not say "non-interleaved" is hard to define. It's not. It's hard to know what to put in a glossary.
The definition you are relying on.
Searching the internet for clarity leads to from individuals essentially quoting MDN and other individuals pointing out the language needs clarification.
Consider a sampling,
Interleaved / Non-Interleaved Decoded Audio #59 https://github.com/WICG/web-codecs/issues/59
The web platform only use planar buffers for audio, but that's probably because there was no interaction with codecs or IO, where interleaved audio is often preferred.
A interleaving/deinterleaving routine is probably very very fast in wasm, but I don't know how fast.
What is the difference between AV_SAMPLE_FMT_S16P and AV_SAMPLE_FMT_S16? https://stackoverflow.com/questions/18888986/what-is-the-difference-between-av-sample-fmt-s16p-and-av-sample-fmt-s16
AV_SAMPLE_FMT_S16P is planar signed 16 bit audio, i.e. 2 bytes for each sample which is same for AV_SAMPLE_FMT_S16.
The only difference is in AV_SAMPLE_FMT_S16 samples of each channel are interleaved i.e. if you have two channel audio then the samples buffer will look like
c1 c2 c1 c2 c1 c2 c1 c2...
where c1 is a sample for channel1 and c2 is sample for channel2.
while for one frame of planar audio you will have something like
c1 c1 c1 c1 .... c2 c2 c2 c2 ..
now how is it stored in AVFrame:
for planar audio: data[i] will contain the data of channel i (assuming channel 0 is first channel).
however if you have more channels than 8, then data for rest of the channels can be found in extended_data attribute of AVFrame.
for non-planar audio data[0] will contain the data for all channels in an interleaved manner.
followed by comment
I assume c1 c1 c2 c2 must refer to the bytes in the buffer, not the samples. Should either change it to c1 c2 c1 c2 for samples, or update the text to say bytes. – DuBistKomisch Sep 20 '16 at 10:55
Planar/interleaved option? #3 https://github.com/raymond-h/pcm-format/issues/3
When it does matter, pretty much all other modules use interleaved format, because that works nicely with streams, since you typically don't know when exactly a stream will end.
What's the interleaved audio ? [closed] https://stackoverflow.com/questions/17879933/whats-the-interleaved-audio
Generally speaking, if you have 2 channels, let's call them L for left and R for right, and you want to transmit or store 20 samples, then:
Interleaved = LRLRLRLRLRLRLRLRLRLR
Non-Interleaved = LLLLLLLLLLRRRRRRRRR
comment following
good answer, although non-interleaved generally means that you would actually have two buffers for your example, one containing only left samples, and one containing only right samples. – Mark Heath Jul 26 '13 at 15:53
Re
At this point, I was just want to remove it from the AudioBuffer section and maybe just say the it nominally contains linear PCM samples. Nothing about interleaving, floating point or even the range.
That is one option. Doing nothing is not a viable option; at least since the problem is recognized then would be malefeasance at this point. Include the definition or remove the term, as suggested above.
One way to view this issue is an oppurtunity for the Web Audio API to actually be an authority on web platform audio, since this is precisely the subject matter of the specification.
Teleconf: Remove "non-interleaved".
Looks like you have done the exact opposite of what the user wanted! Way to go ...
Describe the issue
The specification does not include the term "planar". MDN does include the term "planar" relevant to Web Audio API https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API/Basic_concepts_behind_Web_Audio_API#Planar_versus_interleaved_buffers
Where Is It
Missing from the specification.
Additional Information
Does encoding raw PCM "Planar versus interleaved buffers", one or the other, impact
AudioWorkletProcessor
output?