dancasimiro / WAV.jl

Julia package for working with WAV files
Other
85 stars 35 forks source link

WAV.jl not reading wav files with incorrect file size in header #109

Open dolphins4all opened 1 year ago

dolphins4all commented 1 year ago

I have come across a wav file that I cannot read using WAV.jl (or LibSndFile.jl). However, I can open it in about any other audio program or library (Matlab, R, Raven, Audacity, …) that I have tried. If I read the wav file into Matlab and just write it back out, then it reads fine with WAV.jl. This later process seems to correct the reading issue using WAV.jl.

I compared the problem wav file with a corrected version created by reading into matlab and writing back out. I am calling the original wav file "mdoc" and the corrected version "mdoc_rewrite". I looked for differences using hexdump. This is the result (*.txt is hexdump output):

$ diff mdoc.txt mdoc_rewrite.txt 
1c1
< 00000000  52 49 46 46 14 22 f9 **01**  57 41 56 45 66 6d 74 20  |RIFF."..WAVEfmt |
---
> 00000000  52 49 46 46 14 22 f9 **61**  57 41 56 45 66 6d 74 20  |RIFF.".aWAVEfmt |
$ 

I believe positions 5-8 are the file chunk size (little endian) and there is a difference. The file size rewritten from Matlab is correct (if the position 8 value is 61, the chunk correctly corresponds to the wav file size minus 8 bytes). I've also tried putting arbitrary numbers in positions 5-8…matlab can still read the file.

Is this an issue with WAV.jl or an incorrectly written wav file? Does it not matter if the file size in the header is incorrect with most programs? Or is there something I’m missing in my understanding of WAV.jl or wav file structure.

Thank you - Robert

mgkuhn commented 1 year ago

If the original file was 0x61f92214 + 8 = 1_643_717_148 bytes long, then the first chunk size in the file was clearly wrong, and this would not be a valid RIFF/WAV file. Some implementations may simply ignore the outermost chunk size, but as you can see in function wavread, this implementation starts by reading chunk_size and then uses that as the end condition in the subsequent while chunk_size >= subchunk_header_size loop that parses the rest of the file. So WAV.jl currently clearly depends on the chunk_size having been filled in correctly.

You could write yourself a very simple tool to correct such broken files, or even better fix the source of these files. How was it produced? I guess to decide whether it is worth to make WAV.jl more tolerate to files with incorrect outermost chunk size, it would be useful to know if this is a very common situation.

dolphins4all commented 1 year ago

@mgkuhn Thank you very much for the response. The source was the internal audio storage of a hydrophone. As to whether this is a common issue, I have no idea. It seems the typical applications in my field (marine bioacoustics) ignore the outer chunk size, but I have only tried one group of files from a particular hydrophone. I will contact the developer and let them know.

I have several datasets from different sources that I can check to see if this is an isolated case and let you know.

Should WAV.jl produce some kind of error message when this occurs? It currently just produces an empty array. Or again this may depend on if this is a common issue.

mgkuhn commented 1 year ago

I got the impression that WAV.jl was originally written such that it can also parse files in a streaming fashion, in situations where there may be no end-of-file being signalled at the end, and in such situations it may have to rely on the outer chunk size to know where the file ends.

If a process writes a WAV file while data is being recorded, it may not know from the beginning how long the file is going to be. In such situations, the process needs to seek at the end of the recording back to near the start of the file to update the chunk size there. If a recording was aborted in some uncontrolled way (e.g., your hydrophone lost power or someone pulled out the storage medium without stopping the recording first), that final adjustment of the chunk size may never have happened, and you ended up with a slightly corrupted file.