Files with many thousands of objects?

ebu / ebu_adm_renderer

The EBU ADM Renderer, written in Python, is the reference implementation of EBU Tech 3388

https://ear.readthedocs.io

BSD 3-Clause Clear License

79 stars 13 forks source link

Files with many thousands of objects? #43

Closed TomArrow closed 2 years ago

TomArrow commented 2 years ago

I'm not sure where else to pose this question so I'll try here. After reading some of the documentation of the ADM metadata, it appears to me that the IDs of some item types like audioPackFormat have only 4 digits to save the unique ID of said item.

Is it then fair to assume that the specification has a hard limit of objects of those types at 9999? If not, what would be the correct way to go above that? My use case specifically would be to have a few hundred "physical" channels in the .WAV file and thousands or tens of thousands of objects at various points in time, often very short objects. I'd be happy to keep all the objects in a single audioContent, one per physical audio track.

Aside from the ID issue, libbw64 seems to have a limit of 1024 track UIDs and this seems to get applied to the chna chunk. I do not actually need more than 1024 "physical" tracks but now I am wondering if I can reuse the same audioTrackUID, audioTrackFormatID, audioPackFormatID.

Those question arise after reading this document: https://adm.ebu.io/use_cases/dynamic_mixed.html

That document deals with multiple objects sharing the same "physical" track but each gets its own audioTrackUID which I imagine would easily bloat the chna chunk over 1024 items. Really for the most part, all the format info stays the same (it's the same bw64 file after all), except that I would like dynamic interpolated positioning data for each object. But that comes from the audioBlockFormat, which seems connected to the audioPackFormat via the audioChannelFormat, so do I need an individual audioPackFomat for each object? And if so, then the mentioned ID issue with only 4 free digits arises.

I suppose I could just limit myself to one object per track, but then I would have no proper logical distinction between unrelated objects and I would also have to let silence or lack of content be part of those objects. Which I suppose wouldn't affect the rendering in the grand scheme of things, but semantically it's not ideal.

For reference, my intended use case is to export ADM files from a video game engine, that's why it can be so many different objects at random times throughout the duration of the file.

As a side note, it would be great if it was possible to allow to let the ADM xml data to be saved in compressed form into the chunk. Gzip perhaps, or zlib or zstd. It should compress really well due to the repeated keywords and with a use case like mine, the filesize really adds up quickly. It's not terribly important though, would just be a nice-to-have.

tomjnixon commented 2 years ago

Hi, asking here is fine. I'm going to close this as it's not really an EAR issue, but feel free to ask follow-up questions.

Is it then fair to assume that the specification has a hard limit of objects of those types at 9999?

The IDs are (or should be) hexadecimal, which gives a maximum of 65535 for 4 digits.

Aside from the ID issue, libbw64 seems to have a limit of 1024 track UIDs and this seems to get applied to the chna chunk

This limit comes from having the chna chunk being written before the data chunk, which is generally preferred. You could just change '1024' to whatever number you need, but it's not the cleanest solution. There is a TODO in setChnaChunk to have it write the CHNA after data if it doesn't fit in the pre-allocated space, but that seems like it might cause other issues, as you'd end up with a file with multiple JUNK chunks.

Some other options would be:

remove the CHNA chunk writing from Bw64Writer, and add your CHNA chunk to postDataChunks_ (with setAxmlChunk, which is badly named as it's not axml-specific)
have setChnaChunk check against the size of the existing CHNA chunk -- that way you can call writeFile with a dummy CHNA chunk that's big enough for all your UIDs, and write the actual data later

In general you do need one CHNA entry per audioChannelFormat you want to render, so you can't work around this issue.

so do I need an individual audioPackFomat for each object?

Not necessarily, you can have multiple audioChannelFormats in an audioPackFormat. This is really for cases like 'direct speakers' (channel-based content) where you really have one format with multiple channels; i'm not really sure that applies in your case.

The technical limitation is that when you reference your tracks from audioObjects, you have to reference all channels in the referenced audioPackFormat. In other words, if you group audioChannelFormats into an audioPackFormat, you can't then split them into separate audioObjects.

I suppose I could just limit myself to one object per track, but then I would have no proper logical distinction between unrelated objects and I would also have to let silence or lack of content be part of those objects. Which I suppose wouldn't affect the rendering in the grand scheme of things, but semantically it's not ideal.

Yeah, I agree.

For reference, my intended use case is to export ADM files from a video game engine, that's why it can be so many different objects at random times throughout the duration of the file.

This sounds fun!

As a side note, it would be great if it was possible to allow to let the ADM xml data to be saved in compressed form into the chunk. Gzip perhaps, or zlib or zstd. It should compress really well due to the repeated keywords and with a use case like mine, the filesize really adds up quickly. It's not terribly important though, would just be a nice-to-have.

BS.2088-1 specifies a bxml chunk, which is the same as axml but (optionally) compressed with GZIP. I don't know if there's any support yet, though; none of the EBU tools do. I'd be interested to add it to the EAR, and maybe a chunk definition to libbw64 (though probably not automatic decompression, as that would introduce a dependency).

TomArrow commented 2 years ago

Thanks for the info. I think with long captures, even 65535 would not be enough. So for now I opted for the option of having one object per channel. Which seems to work fine, but it's not very satisfying. It's a bit like an echo from the times where wav files could only be 2-4 GB haha. Having at least uint32 worth of objects would be nice. Hell, if it was me, I'd make it a uint64, just to be future proof. But I suppose it's a bit late for that now.

Having multiple channels per pack ... I'm not sure if I'm understanding it correctly, but that seems like it would semantically be kind of a nightmare too and who knows how processors would deal with it. Since it's not separate channels, but the same channel at different points in time. Unless I'm misunderstanding.

In any case, the results I'm getting so far are pretty encouraging. Just managed to do my first proper working render with EAR and it's a really cool toy to have.

That's good to know about the bxml chunk, exactly the kind of thing I was looking for. I guess I'll wait a bit until it's more widely supported. I do want to render my files after all.

One more thing I'm wondering about the metadata ... I couldn't quite figure out what the unit of the coordinate system is supposed to be. There's talk about a unit circle and absolute values, but ... in what units? I believe this would be important in order to get a proper spatial representation of distance during the render, for example if one object was at 0,0,0 and another at 0,100,0, then the relative loudness between them would differ depending on the unit. A sound that's 100 mm away would sound louder than a sound that's 100m away, but the one at 0,0,0 would always have full loudness.

Also I don't know if that's something that EAR does, but I vaguely recall reading something about modulating sound based on frequencies and distance based on how fast they travel through air, which would also require a real world coordinate system to make sense. For example where sounds farther away lose high frequency content and such.

Cheers

TomArrow commented 2 years ago

Okay I did a few more tests and it seems EAR does not really attenuate the loudness of objects, or at least it doesn't do so beyond a distance of 1. Is that correct? (specifically my tests all used cartesian coordinates) Maybe I misunderstood ADM a bit. I thought the intent was to place an object anywhere in a scene, but it appears that it can be placed only within the unit sphere/cube, the outer edge of which appears to be defined by the screen distance?

I did find an answer to the unit question at least, I think I must have been looking at the wrong or old files. So that "absoluteDistance" one basically gives a size to the unit sphere/cube in meters. That way I could basically say that my unit cube is 200 meters in size and scale all my objects within that. Which I guess is fair enough. However searching this repository for "absoluteDistance" does not give any results beyond attribute definitions. Am I fairly assuming that this parameter is ignored? Am I completely misunderstanding the intent of ADM?

Does it make sense for me to pursue using EAR for getting properly distance attenuated loudness of objects or should I write my own renderer for that?

tomjnixon commented 2 years ago

Sorry for the delay.

re. coordinate systems, the description of these is in BS.2076-2 section 8.

re. the distance, the EAR doesn't ever attenuate based on the distance. This is because distance rendering doesn't really make much sense on loudspeakers without reverberation (or at least some reflections), and that was considered too brittle and complicated for the EAR. Instead of implementing partial distance rendering, it was thought best to leave that entirely up to the producer.

Some of the rendering techniques used also come from the cinema world where you're expected to be able to move objects around the space within the loudspeakers without a level change.

The EAR treats objects with a distance of 1 (for polar, or objects on the surface of the cube for cartesian) as being on the loudspeakers. Objects that are closer than that result in effects which are useful for moving objects through the space (generally causing spreading across more loudspeakers), and there's no effect for objects further than that. The absoluteDistance is ignored.

If you want a distance-related gain effect (which you probably do for what you're doing!), that should be encoded into the gain parameter, or the audio samples, with a distance of 1.

I hope that helps. It seems like there is a bit of a mismatch between what you're doing and what the EAR was designed for. It should be relatively easy to make it work, though.

TomArrow commented 2 years ago

No worries about the delay. Thanks for the info.

I thought about the gain approach but that wouldn't account for something like the frequency-based attenuation.

Would it be altogether "wrong" for me to simply write a renderer (for example using OpenAL) that accepts absolute cartesian coordinates instead of the normalized ones and does what I need? I mean, would that be considered a breach of the ADM standard?

A few more questions if you'll indulge me (no hurry!):

I noticed that sounds coming from the back get "muffled" by EAR. Is that intended? I would think that when there are speakers behind me, they should just sound normal, just like if the sound was behind me actually. I noticed that with the 0+5+0 setting, didn't try others. Or am I imagining this?
It seems that in the 5 speaker preset when a sound comes from a direction where no speaker is available, it more or less disappears, which feels a bit strange. For example a sound coming from the top or bottom I think. Is that intentional or am I doing something wrong?
Does the ADM format allow for XML style comments? I mean this:
```

```
And if it does, would it be possible to integrate something like this into libadm? This would be useful for debugging purposes. Since I ended up just grouping each track into a single object, it would be nice to have a method to see what sound is what in the XML. Alternatively, does the specification allow for any kind of customization of attributes in a new namespace or something like that? Say I wanted to make my own specialized addition like a velocity vector encoded as velocity="2.0,2.1,5.3" for a blockformat, is there a way to do that? And if so, how would I do that with libadm? It's just an example and I realize in this case I could likely just calculate velocity from the positions, but I could get a higher precision value working from the higher resolved raw data that I use to create the metadata (I end up subsampling it so there's roughly one audioblockformat per video frame so as not to bloat the file). Sorry if this is something I should be asking on the libadm repository but I don't want to clutter the repositories with questions.

Anyway, thanks for your time! This does help and clarify things.

tomjnixon commented 2 years ago

I thought about the gain approach but that wouldn't account for something like the frequency-based attenuation.

Yeah, you'd have to bake that into the samples if that's what you want.

Would it be altogether "wrong" for me to simply write a renderer (for example using OpenAL) that accepts absolute cartesian coordinates instead of the normalized ones and does what I need? I mean, would that be considered a breach of the ADM standard?

Not at all -- the ADM is really intended to help transport metadata between systems, so that's expected. The EAR was really intended to give a reference for how things should sound, which can help with interoperability if people agree to use it, but it's definitely not the only way of doing things.

I noticed that sounds coming from the back get "muffled" by EAR. Is that intended? I would think that when there are speakers behind me, they should just sound normal, just like if the sound was behind me actually. I noticed that with the 0+5+0 setting, didn't try others. Or am I imagining this?

Sounds behind you should get split between the two rear loudspeakers only. The exact behaviour depends on the metadata though; this only applies for (i think):

polar coordinates with a distance of 1, an azimuth of between +110 and -110, and an absolute elevation less than 30 degrees
cartesian coordinates with y=-1

Anything else will start spreading to the other loudspeakers, too, which could well be perceived as being muffled.

It seems that in the 5 speaker preset when a sound comes from a direction where no speaker is available, it more or less disappears, which feels a bit strange. For example a sound coming from the top or bottom I think. Is that intentional or am I doing something wrong?

That shouldn't happen, the gains should be normalised as long as the gain is not set. Are you sure you're using 0+5+0? If you have some example metadata, software version and CLI arguments i could investigate.

Does the ADM format allow for XML style comments

it should parse fine (it uses a standard XML parser) but you will not be able to access the comment in the parsed ADM document, and i guess you would not be able to easily generate that using libadm. Same for custom elements; libadm should ignore them, but i would put them in your own namespace (unfortunately there's no standardised namespace for ADM, which is not helpful...).

I would like to make this kind of thing easier with libadm, as well as EAR -- I wrote some thoughts about how this might work a while ago in https://github.com/ebu/ebu_adm_renderer/issues/38

This is a bit trickier for libadm though, as the xml parser (rapidxml) is a private dependency, and exposing it to the user would cause some headaches (expect lots of warnings on windows) so this would have to be an optional feature.

If you want a bit more granularity with names you could use the audioObjectLabel in the libadm -2 branch.