Farama-Foundation / ViZDoom

Reinforcement Learning environments based on the 1993 game Doom :godmode:
https://vizdoom.farama.org/
1.73k stars 401 forks source link

Suggestion: API/commands for fetching audio #225

Closed Miffyli closed 3 years ago

Miffyli commented 7 years ago

E.g. by commanding "enable_audio" before initing the game, and then receiving additional object in State object which holds audio samples played inside that time frame.

I know it is VizDoom but this could possibly allow bots to "home in" towards high-action areas and/or hear close-by enemies behind them.

mwydmuch commented 7 years ago

Hi @Miffyli, this idea has been on our minds for some time now and I'd love to add it. However, it's easy to just pass OpenAL buffer as it is to state, but I have no idea if it'll be convenient to work with it (unfortunately I've never done any serious sound processing). So we need some help to decide what things we should take care of and what should be configurable (format? stereo/some 3D sound, channels? frequency? sample rate/size?).

If anyone has any ideas about these things I'll be happy to hear them :)

Miffyli commented 7 years ago

@mwydmuch I have background in speech processing but still stuck deciphering the structure of VizDoom source ^^'.

Anywho I do not think we need anything fancy, especially considering Doom was originally indented to run on old machines. I think these would be enough, at least for a start:

And as for what API would give to user: 2xN matrix where N is the amount of samples played in that state's timeframe. I think "timeframe" = "since last call of get_state", for simplicity. Users can then build longer buffer in Python for analyzing longer pieces of audio.

I can create example scenarios/scripts, and generally test the implementation if this is added.

mwydmuch commented 7 years ago

Alright, thank you @Miffyli for the tips! For now I'm pretty busy, but I think that I will be able to add this by the end of August and then I will ask you for a small tests and review :)

piquirez commented 4 years ago

Hi @mwydmuch : Any news on any way to get the sound buffer as part of Doom.get_state()? The research that can be carried out by adding the audio is very interesting. If anyone knows about any method to obtain the sound as part of the inputs it would be much appreciated.

Miffyli commented 4 years ago

@piquirez

I did further digging on this subject earlier, and I think it hits a roadblock: ZDoom uses OpenAL library to create the sound samples from sound sources/listeners and their locations. You'd have to start messing around with OpenAL (and its drivers) to be able to hijack these samples at some part of the way before they are fed into a common buffer.

A hacky way to do this would be to create a sound device per each vizdoom instance and capture the audio there, but syncing this up with frames would be difficult if not even impossible.

piquirez commented 4 years ago

@Miffyli Thanks for your answer. As you describe, it does seem quite complicated. I did notice that the sound plays at the same speed no matter what the speed of the screen inputs is, making it very hard to sync them if you wanted to capture the audio since when you use it for training the speed will be different than inference. However this gave me an idea; I presume that each sound is triggered by a doom instruction in a particular frame, and we know that in real time doom should run at 35 FPS?. In this case it should be possible to save the audio triggers on each frame, and then divide each sound in small samples over 35 FPS. in this way we would obtain one audio sample per frame which is what we are after. Is this something feasible? Even if it was possible to get the audio triggers per frame that would be very helpful, and I could sort out the audio divide part.

Miffyli commented 4 years ago

@piquirez

Theoretically that could work. However since it would skip audio library completely it would not have any processing done by the positional audio (e.g. how strong audio plays on left/right, how faint it is). Now that you mention it, the "sped up" game also makes things harder: If you do things through audio library, it (probably) plays sounds at the natural speed and thus far too slow for the ZDoom running at lightspeeds (at thousands FPS).

piquirez commented 4 years ago

@Miffyli I believe the stereo information about a sound should be part of the command that executes the sound. So if all sounds are saved in mono, the stereo version would simply be a mathematical relation based on the position of the player. If we have the information of the frame that plays a sound, which should include where it has to play spatially (stereo information) we could create as you mentioned earlier "2xN matrix where N is the amount of samples played in that state's timeframe". in this case N would be 1/35th of a second of the audio file. In this way, no matter at which speed doom is run, the agent will always get the same sound synchronized to 35 FPS., which will allow it to learn.

Miffyli commented 4 years ago

@piquirez

Hmm you are right, this could work. I am not sure how easy all the "positional audio processing" would be, but the part of providing samples of sounds-being-played should be possible. It is not perfect but it would be a start.

As for implementing something like this: I am not intimately familiar with ZDoom on that side and do not have time to work on this for at least couple months, sadly :(

piquirez commented 4 years ago

@Miffyli A couple of months doesn't sound bad. I don't have much time myself either, but I will research on how zdoom handles sounds whenever I get the chance and hopefully I'll be able to help you if you're interested.

hegde95 commented 4 years ago

Hey, was anyone able to get this up?

Miffyli commented 4 years ago

I have not worked on this since last posts, my attention shifted to other projects sadly :( . The above issues are still complex to handle, as playing audio (or sound, as it were) is so tightly tied to our "natural passage" of time.

hegde95 commented 4 years ago

would it be possible to get audio in "real time" by using the fix in #40 ?

mwydmuch commented 4 years ago

Hi @hegde95, as described in #40 the audio can be enabled so for sure it's possible to obtain it from os somehow. On Linux, you can probably access Pulse Audio sink using some Python library. But I guess that's all we know about the topic right now.

mwydmuch commented 4 years ago

This approach will require to use ViZDoom async mode to have correctly played audio.

hegde95 commented 4 years ago

So I'm guessing if we have multiple games running in parallel, it won't be possible to isolate the sound produced by each game this way?

Miffyli commented 4 years ago

You can create virtual outputs in PulseAudio, and then with some commands direct program's audio to sink you want (I can not find those commands right now). It is doable, but bit of a mess.

If it does not have to be ViZDoom per se, Unity's ML-agents can be tuned to include audio in the observations by creating the necessary AudioListeners etc in the Unity game. We did this in some experiments and worked quite well.

hegde95 commented 4 years ago

if I had to push the audio buffers collected in async mode to the ViZDoomPythonModule.cpp as a part of the game state, what would be the changes I'd have to make?

Miffyli commented 4 years ago

@mwydmuch Could you provide quick pointers to above?