BUTSpeechFIT / EEND_dataprep

49 stars 7 forks source link

Wired overlap ratio for simulation dataset. #3

Closed C-ra-zy-97 closed 1 year ago

C-ra-zy-97 commented 1 year ago

Recently, I have used the v1 recipe to generate simulation data from sre & switchboard. The callhome1_spk2 is used to estimate statistics (overlap ratio: 13.529%). However, after the simulation, I found the overlap ratio of the simulation dataset is 20.134 %. How should debug this problem?

fnlandini commented 1 year ago

That is quite strange. I have not used the callhome1_spk2 set but with callhome1 I obtained the attached files at this step in the recipe You can perhaps use them and check what you obtain. Besides, I have not calculated the overlap ratio for none of the sets but rather the percentage of time in the recordings when there is overlap, as stated in Table 1. Plus, I used callhome1 (all amounts of speakers, not only 2). Still, I would expect that the overlap ratios between the real and estimated sets are similar.

diff_spk_overlap.txt diff_spk_pause.txt diff_spk_pause_vs_overlap.txt newspk_samespk_pause_distribution_overlap_distribution.txt overlaps_info.txt same_spk_pause.txt

C-ra-zy-97 commented 1 year ago

Thanks for the very quick reply. There are indeed some difference. I have attached related files in the follow, can you help me to debug it? If you have no time, can you send me the rttm file for callhome? Then I can check if there is some difference. callhome1_spkall_rttm.txt diff_spk_overlap.txt diff_spk_pause_vs_overlap.txt diff_spk_pause.txt newspk_samespk_pause_distribution_overlap_distribution.txt overlaps_info.txt same_spk_pause.txt

fnlandini commented 1 year ago

I am afraid I cannot share the rttm because it does not have a free license :/ You could run the whole data generation pipeline using these statistics but I understand that takes time. My recommendation is to plot your and my files as distributions so that you can get an idea if they differ too much.

C-ra-zy-97 commented 1 year ago

Actually, the CALLHOME dataset is not public but the rttm file is public and can be downloaded from http://www.openslr.org/resources/10/sre2000-key.tar.gz. Besides, I have provided my callhome1 rttm file in the above. Can you help me to check it?

fnlandini commented 1 year ago

Sorry, I did not notice you had shared the rttms in the previous message. I have run a diff between the rttms I used and these. Besides trailing 0's, the differences in timings are only a few segments because of rounding (see attached picture) and this line SPEAKER iait 1 350.72 0.00 <NA> <NA> iait_A <NA> was in your file but it has zero length so it should not matter. Overall, the rttms are the same diff1

C-ra-zy-97 commented 1 year ago

The distribution: iTerm2 cUIfZo same-spk-pause iTerm2 Lq4Iym new-spk-pause iTerm2 mOltS9 overlap

C-ra-zy-97 commented 1 year ago

That is quite strange. I have not used the callhome1_spk2 set but with callhome1 I obtained the attached files at this step in the recipe You can perhaps use them and check what you obtain. Besides, I have not calculated the overlap ratio for none of the sets but rather the percentage of time in the recordings when there is overlap, as stated in Table 1. Plus, I used callhome1 (all amounts of speakers, not only 2). Still, I would expect that the overlap ratios between the real and estimated sets are similar.

diff_spk_overlap.txt diff_spk_pause.txt diff_spk_pause_vs_overlap.txt newspk_samespk_pause_distribution_overlap_distribution.txt overlaps_info.txt same_spk_pause.txt

By the way, can you share the code that you used to calculate the overlap ratio in the paper?

fnlandini commented 1 year ago

As I said, it is NOT overlap ratio but the percentage of overlap over the length of the file. The code is here: https://github.com/BUTSpeechFIT/diarization_utils/blob/main/compute_stats.py You would need to sum the percentages for the categories 2, 3 and 4-or-more simultaneous speakers to get the "overlap" I report in the table.

I hope this helps

C-ra-zy-97 commented 1 year ago

Thank you so much. I'm using your statistics to generate data. I'll let you know the results once I'm done generating them

C-ra-zy-97 commented 1 year ago

When I use your statistics, the overlap ratio for simulation data is correct. Pretty strange.

fnlandini commented 1 year ago

So basically the difference is between using Callhome part 1 or Callhome part 1 only 2 speaker files, correct? I do not remember having analyzed the statistics about those two sets but maybe they differ substantially (even though one is subset of the other).

C-ra-zy-97 commented 1 year ago

Sorry for the conclusion last day. Today, I repeated all experiments from scratch. Actually, the simulation datasets generated from call1_spkall and call1_spk2 statistics are similar. Today, I deleted all intermediate results and regenerated all the files, and the final result is correct. Last day, I met some errors in specific steps and I ran it many times. There may be some problems in this process. Anyway, thank you very much for your prompt reply. Great work, thanks for contributing code!!!!

someonefighting commented 5 months ago

That is quite strange. I have not used the callhome1_spk2 set but with callhome1 I obtained the attached files at this step in the recipe You can perhaps use them and check what you obtain. Besides, I have not calculated the overlap ratio for none of the sets but rather the percentage of time in the recordings when there is overlap, as stated in Table 1. Plus, I used callhome1 (all amounts of speakers, not only 2). Still, I would expect that the overlap ratios between the real and estimated sets are similar.

diff_spk_overlap.txt diff_spk_pause.txt diff_spk_pause_vs_overlap.txt newspk_samespk_pause_distribution_overlap_distribution.txt overlaps_info.txt same_spk_pause.txt

Hi fnlandini, why there are negative pauses in the same_spk_pause.txt ? It is weird. And I noticed that the code allows negative intervals between the same speaker.(https://github.com/BUTSpeechFIT/EEND_dataprep/files/10909638/same_spk_pause.txt)

fnlandini commented 5 months ago

Hi @someonefighting Yes, there should not be negative pauses there. I suspect there could be some error in the annotations or the code but it would need a more careful analysis. I might be able to analyze it at some point but not right now. The code of conv_generator could be updated to discard negative values, you are right.

someonefighting commented 5 months ago

Anyway, thanks for your great job!