Closed popcornell closed 10 months ago
Awesome! I'll merge once you can verify that the performance remains unchanged (which I believe it should) :)
There seems to be some inconsistency when I change the number of GPUs (I don't think this depends on this PR however). It seems that the more GPUs the more the memory occupation ?!
With 3 GPUs:
2023-11-15:20:42:15,574 INFO [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments)
2023-11-15:20:42:20,890 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:42:32,172 INFO [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments)
2023-11-15:20:42:32,706 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:42:32,863 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks.
2023-11-15:20:42:46,44 INFO [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments)
2023-11-15:20:42:46,350 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:42:46,589 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks.
2023-11-15:20:42:46,810 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 4 chunks.
2023-11-15:20:42:47,21 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 5 chunks.
2023-11-15:20:43:04,114 INFO [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)
With 2 GPU:
2023-11-15:20:46:13,89 INFO [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments)
2023-11-15:20:46:20,761 INFO [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments)
2023-11-15:20:46:30,84 INFO [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments)
2023-11-15:20:46:30,625 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks.
2023-11-15:20:46:30,859 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks.
2023-11-15:20:46:31,79 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 4 chunks.
2023-11-15:20:46:42,373 INFO [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)
With 1 GPU:
2023-11-15:20:44:19,322 INFO [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments)
2023-11-15:20:44:26,218 INFO [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments)
2023-11-15:20:44:30,689 INFO [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments)
2023-11-15:20:44:38,79 INFO [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)
There seems to be some inconsistency when I change the number of GPUs (I don't think this depends on this PR however). It seems that the more GPUs the more the memory occupation ?!
With 3 GPUs:
2023-11-15:20:42:15,574 INFO [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments) 2023-11-15:20:42:20,890 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks. 2023-11-15:20:42:32,172 INFO [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments) 2023-11-15:20:42:32,706 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks. 2023-11-15:20:42:32,863 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks. 2023-11-15:20:42:46,44 INFO [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments) 2023-11-15:20:42:46,350 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks. 2023-11-15:20:42:46,589 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks. 2023-11-15:20:42:46,810 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 4 chunks. 2023-11-15:20:42:47,21 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 5 chunks. 2023-11-15:20:43:04,114 INFO [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)
With 2 GPU:
2023-11-15:20:46:13,89 INFO [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments) 2023-11-15:20:46:20,761 INFO [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments) 2023-11-15:20:46:30,84 INFO [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments) 2023-11-15:20:46:30,625 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 2 chunks. 2023-11-15:20:46:30,859 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 3 chunks. 2023-11-15:20:46:31,79 WARNING [enhancer.py:247] Out of memory error while processing the batch. Trying again with 4 chunks. 2023-11-15:20:46:42,373 INFO [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)
With 1 GPU:
2023-11-15:20:44:19,322 INFO [enhancer.py:207] Processing batch 1 ('S26', 'P29'): 1 segments = 8.89s (total: 0 segments) 2023-11-15:20:44:26,218 INFO [enhancer.py:207] Processing batch 2 ('S26', 'P30'): 1 segments = 18.55s (total: 1 segments) 2023-11-15:20:44:30,689 INFO [enhancer.py:207] Processing batch 3 ('S26', 'P31'): 1 segments = 63.23s (total: 2 segments) 2023-11-15:20:44:38,79 INFO [enhancer.py:207] Processing batch 4 ('S26', 'P32'): 1 segments = 15.86s (total: 3 segments)
That's strange. I have never seen this happen before. Can you check if your GPUs are configured to not share memory?
That's strange. I have never seen this happen before. Can you check if your GPUs are configured to not share memory?
Yep they were in DEFAULT mode. Changed to exclusive mode and it does not happen anymore. Maybe I should put a check into the code for the GPUs compute mode ?
Maybe I should put a check into the code for the GPUs compute mode ?
Something to prevent it would be great. I had this issue in CHiME-7. It would be great to have the check such, that the user doesn't have to change the mode.
I think it should be sufficient to add this in the README (perhaps as an FAQ), instead of restricting certain modes in the processing. OOM issues can happen for a variety of reasons, such as if the GPU memory is not cleared from a previous running process or misconfigured nodes, and we cannot expect to solve all such problems.
I think I can grep the processing mode from nvidia-smi -q
, but not sure if this will work on all clusters out there.
I can however put an additional arg to disable this check, with a big warning
I think I can grep the processing mode from
nvidia-smi -q
, but not sure if this will work on all clusters out there. I can however put an additional arg to disable this check, with a big warning
You could put these instructions in the README, so that users running the code can check. No need to add it in the code, IMO.
Discussing offline with @boeddeker, we added a new utils called gpu_check. The idea is to use it as this (e.g. CHiME-7 asr1 recipe):
$cmd JOB=1:$nj ${exp_dir}/${dset_name}/${dset_part}/log/enhance.JOB.log \
gss utils gpu_check $nj $cmd \& gss enhance cuts \
${exp_dir}/${dset_name}/${dset_part}/cuts.jsonl.gz ${exp_dir}/${dset_name}/${dset_part}/split$nj/cuts_per_segment.JOB.jsonl.gz \
${exp_dir}/${dset_name}/${dset_part}/enhanced \
--bss-iterations $gss_iterations \
--context-duration 15.0 \
--use-garbage-class \
--min-segment-length 0.0 \
--max-segment-length $max_segment_length \
--max-batch-duration $max_batch_duration \
--max-batch-cuts 1 \
--num-buckets 4 \
--num-workers 4 \
--force-overwrite \
--duration-tolerance 3.0 \
${affix} || exit 1
However it will not exit when it raises an exception used as this. My bash is bad, do you know how to make it to exit ?
Discussing offline with @boeddeker, we added a new utils called gpu_check. The idea is to use it as this (e.g. CHiME-7 asr1 recipe):
$cmd JOB=1:$nj ${exp_dir}/${dset_name}/${dset_part}/log/enhance.JOB.log \ gss utils gpu_check $nj $cmd \& gss enhance cuts \ ${exp_dir}/${dset_name}/${dset_part}/cuts.jsonl.gz ${exp_dir}/${dset_name}/${dset_part}/split$nj/cuts_per_segment.JOB.jsonl.gz \ ${exp_dir}/${dset_name}/${dset_part}/enhanced \ --bss-iterations $gss_iterations \ --context-duration 15.0 \ --use-garbage-class \ --min-segment-length 0.0 \ --max-segment-length $max_segment_length \ --max-batch-duration $max_batch_duration \ --max-batch-cuts 1 \ --num-buckets 4 \ --num-workers 4 \ --force-overwrite \ --duration-tolerance 3.0 \ ${affix} || exit 1
However it will not exit when it raises an exception used as this. My bash is bad, do you know how to make it to exit ?
The exit
should work if any of the job fails. But I think this whole GPU check thing is overkill. GPU memory issues can happen in any program, and I don't see why it needs to be included in this repo specifically. I can add it if you are using it in ESPNet, but I personally think this is not the right place to solve this issue.
In the meantime, I confirm i get same results with old version on CHiME-7:
###################################################
### Metrics for all Scenarios ###
###################################################
+----+------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------+
| | scenario | num spk hyp | num spk ref | tot utterances hyp | tot utterances ref | hits | substitutions | deletions | insertions | wer |
|----+------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------|
| 0 | chime6 | 8 | 8 | 6644 | 6644 | 42884 | 11672 | 4325 | 3107 | 0.324451 |
| 0 | dipco | 20 | 20 | 3673 | 3673 | 22175 | 5817 | 1974 | 2210 | 0.333745 |
| 0 | mixer6 | 118 | 118 | 14804 | 14804 | 126632 | 15991 | 6358 | 7815 | 0.202469 |
+----+------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------+
####################################################################
### Macro-Averaged Metrics across all Scenarios (Ranking Metric) ###
####################################################################
+----+---------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------+
| | scenario | num spk hyp | num spk ref | tot utterances hyp | tot utterances ref | hits | substitutions | deletions | insertions | wer |
|----+---------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------|
| 0 | macro-average | 48.6667 | 48.6667 | 8373.67 | 8373.67 | 63897 | 11160 | 4219 | 4377.33 | 0.286888 |
+----+---------------+---------------+---------------+----------------------+----------------------+--------+-----------------+-------------+--------------+----------+
Added some lines in the README.md
Most code comes from @boeddeker . Also he raised the issue here https://github.com/desh2608/gss/issues/33
I am re-running the code on CHiME-7 to see if it will match the previous version.