Open royrs opened 2 months ago
If I understand correctly, you are saying that the detection is robust to cropping but the message extraction is not?
We mainly focused on detection in our paper and not so much on the message hiding part, so it is possible that we missed this.
Indeed, the detection is quite robust, giving results close to 1. However, to extract the message itself correctly you need to have the first 100 samples. Thus cropping out a small part from the start of the audio will extract the wrong message (though detection will still work with value usually >0.9)
After playing with I found other cases that works, but I couldn't find a pattern when it works. For example:
Interesting! I wonder why the behavior is different between detection and decoding actually, since the objectives are trained jointly. Are you doing the experiments with only one fixed message? I found that depending on the message, the robustness can be quite variable in my case.
The experiment I did were with one specific message so far. However, I repeated them now with different messages.
I tried taking different parts of the watermarked audio for 1000 different randomized messages. (All the figures below are number of messages getting each value)
When I remove only the first 5 samples, the detection works almost perfectly for all messages (value~0.999), where the % of bits predicted correctly goes as follow:
When removing 10 samples, the detection still work well with some messages with values ~0.95. And the bit accuracy goes down (it never predict the entire message correctly):
keep removing up to 100, things goes worse. only 50% of messages have detection>0.9, see below the distribution: While the bit accuracy :
Interestingly, the results are completely different if we specifically take samples 16000-32000 - they improve significantly. We predict the message almost perfectly with just a couple of messages predicting only 1 bit wrong. In addition, the detection is almost always > 0.9:
Can you post your code? How much audio are you decoding at a time? I'm super curious about this, because getting bits wrong kills the usability (at least for me), and I've noticed similar behavior, or at least, wrong messages.
I agree about the usability, so if you find a solution, it will be great if you update.
I made the code as simple as their README. see below the code for the histograms above (change 16k\32k to choose which part to crop).
import torch
import soundfile
from audioseal import AudioSeal
wav, sr = soundfile.read('example.wav')
wav = torch.from_numpy(wav).unsqueeze(0).unsqueeze(0).to(torch.float).to(device)
def add_wm(audio, msg):
watermark = encoder.get_watermark(wav, sample_rate=sr, message=msg)
return audio + watermark
def detect(msg, wm_audio, start, end):
score, dmsg = detector.detect_watermark(watermarked_audio[:,:,s:e], sr)
acc = (dmsg == msg).float().mean()
return score, acc
scores = []
accuracy = []
for _ in range(1000):
msg = torch.randint(0, 2, (1, 16)).to(device).float()
wm_audio = add_wm(wav, msg)
score, acc = detect(msg, wm_audio, 16000, 32000)
scores.append(score.detach().cpu().item())
accuracy.append(acc.detach().cpu().item())
I will say I've noticed it does better if the window you're scanning on matches the window it was embedded on. For instance, if I take an audio file, and break it into 10s chunks, then scanning in 10s chunks also performs the best.
Not so much that it's consistently or even a majority of the time the correct message though. I think that has to do with me watermarking music, as opposed to spoken word like it was trained on.
From my experiments it doesn't necessarily works best for same window. For example watermark 1s of audio, take only part of it and add zeros before to get back to 1s segment (which is likely if any small part in the audio has been cropped out). If I do it with the examples above, it work better to add zeros detecting 1s segment than just detecting the WM part. However if I take some random number such as 9736, it detects better when looking at the WM part and not a full 1s segment.
I tried working with spoken data, and there is no difference in term of message extraction. Moreover, if you have silence parts, it sometimes will add a "beeping" sound.
I've tried working with the given checkpoints and noticed that the important part of the watermark is contained in the first 100 samples, where the rest doesn't really matter.
I did tests with two different audios (see attached). audios.zip
I'm following the example in the README, first encoding some random msg, followed by detecting it. When I pass the entire watermarked audio, it extracts the msgs perfectly. However, it keep succeeding perfectly even if I pass the detector only the first 100 samples (it fails for less). In contrast, if I remove the first 10 samples, it fails in extracting the message (result=1, but the extracted msg is wrong).
This basically means that always the first 100 samples are the important part for detection, and even removing some of it will fail. Therefore, any edits that remove the first samples will cause the msg to be lost.
Is there something I can do to make it more robust to cropping? Or I need to fine-tune the model to solve this issue?