FrenchKrab / IS2023-powerset-diarization

Official repository for the "Powerset multi-class cross entropy loss for neural speaker diarization" paper published in Interspeech 2023.
68 stars 4 forks source link

Pickling Error when running without test set. #6

Closed Ashh-Z closed 7 months ago

Ashh-Z commented 7 months ago

Reference for code : https://colab.research.google.com/drive/1S7ayat76N-xluD4gvN958O7QCpW8-u0l?usp=sharing

Reference for setting database.yml : https://github.com/pyannote/AMI-diarization-setup/tree/main

I am trying the fine-tune the pyannote speaker diarization model with powerset loss on my custom dataset. But I don't wish to use a validation set and have tried to disable the early stopping. I have done this by setting the parameter num_sanity_val_steps = 0 for pytorch_lightning trainer .

My database.yml file (finetune.yml) :

Databases:
   dis: dev_true\AUDIO_supervised\SD\{uri}.wav

Protocols:
    dis:
       SpeakerDiarization:
          ash:
            train:
               uri: dev_lst\dev_lst.txt
               annotation: dev_true_label\Labels\SD\{uri}_SPEAKER.rttm
               annotated: dev_uems\{uri}.uem
            # test :
            #    uri: dummy_eval_lst\dummy_eval_lst.txt
            #    annotation: dummy_eval_rttm\{uri}.rttm
            #    annotated: dummpy_eval_uem\{uri}.uem

Directory setup : image

Code block :

from types import MethodType
from torch.optim import Adam
from pytorch_lightning.callbacks import (
    EarlyStopping,
    ModelCheckpoint,
    RichProgressBar,
)
from pytorch_lightning import Trainer

# we use Adam optimizer with 1e-4 learning rate
def configure_optimizers(self):
    return Adam(self.parameters(), lr=1e-4)

segmentation_model.configure_optimizers = MethodType(configure_optimizers, segmentation_model)

# we monitor diarization error rate on the validation set
# and use to keep the best checkpoint and stop early
monitor, direction = segmentation_model.task.val_monitor
checkpoint = ModelCheckpoint(
    monitor=monitor,
    mode=direction,
    save_top_k=1,
    every_n_epochs=1,
    save_last=False,
    save_weights_only=False,
    filename="{epoch}",
    verbose=False,
)
early_stopping = EarlyStopping(
    monitor=monitor,
    mode=direction,
    min_delta=0.0,
    patience=10,
    strict=True,
    verbose=False,
)

# callbacks = [RichProgressBar(), checkpoint, early_stopping]
callbacks = [RichProgressBar(), checkpoint]

# we train for at most 20 epochs (might be shorter in case of early stopping)
from pytorch_lightning import Trainer
# trainer = Trainer(accelerator="gpu", 
#                   callbacks=callbacks, 
#                   max_epochs=20,
#                   gradient_clip_val=0.5)

trainer = Trainer(accelerator="gpu", 
                 callbacks=callbacks, 
                 max_epochs=20,
                 gradient_clip_val=0.5,
                 num_sanity_val_steps=0) # Skip sanity check validation

trainer.fit(segmentation_model)

Error :

PicklingError Traceback (most recent call last) File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\trainer\call.py:44, in _call_and_handle_interrupt(trainer, trainer_fn, *args, kwargs) 43 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs) ---> 44 return trainer_fn(args, kwargs) 46 except _TunerExitException:

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\trainer\trainer.py:579, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) 573 ckpt_path = self._checkpoint_connector._select_ckpt_path( 574 self.state.fn, 575 ckpt_path, 576 model_provided=True, 577 model_connected=self.lightning_module is not None, 578 ) --> 579 self._run(model, ckpt_path=ckpt_path) 581 assert self.state.stopped

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\trainer\trainer.py:986, in Trainer._run(self, model, ckpt_path) 983 # ---------------------------- 984 # RUN THE TRAINER 985 # ---------------------------- --> 986 results = self._run_stage() 988 # ---------------------------- 989 # POST-Training CLEAN UP 990 # ----------------------------

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\trainer\trainer.py:1032, in Trainer._run_stage(self) 1031 with torch.autograd.set_detect_anomaly(self._detect_anomaly): -> 1032 self.fit_loop.run() 1033 return None

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\loops\fit_loop.py:197, in _FitLoop.run(self) 196 def run(self) -> None: --> 197 self.setup_data() 198 if self.skip:

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\loops\fit_loop.py:263, in _FitLoop.setup_data(self) 262 self._data_fetcher.setup(combined_loader) --> 263 iter(self._data_fetcher) # creates the iterator inside the fetcher 264 max_batches = sized_len(combined_loader)

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\loops\fetchers.py:104, in _PrefetchDataFetcher.iter(self) 102 @override 103 def iter(self) -> "_PrefetchDataFetcher": --> 104 super().iter() 105 if self.length is not None: 106 # ignore pre-fetching, it's not necessary

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\loops\fetchers.py:51, in _DataFetcher.iter(self) 49 @override 50 def iter(self) -> "_DataFetcher": ---> 51 self.iterator = iter(self.combined_loader) 52 self.reset()

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\utilities\combined_loader.py:351, in CombinedLoader.iter(self) 350 iterator = cls(self.flattened, self._limits) --> 351 iter(iterator) 352 self._iterator = iterator

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\utilities\combined_loader.py:92, in _MaxSizeCycle.iter(self) 90 @override 91 def iter(self) -> Self: ---> 92 super().iter() 93 self._consumed = [False] * len(self.iterables)

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\utilities\combined_loader.py:43, in _ModeIterator.iter(self) 41 @override 42 def iter(self) -> Self: ---> 43 self.iterators = [iter(iterable) for iterable in self.iterables] 44 self._idx = 0

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\utilities\combined_loader.py:43, in (.0) 41 @override 42 def iter(self) -> Self: ---> 43 self.iterators = [iter(iterable) for iterable in self.iterables] 44 self._idx = 0

File e:\ANACONDA\envs\deeplearning\Lib\site-packages\torch\utils\data\dataloader.py:438, in DataLoader.iter(self) 437 else: --> 438 return self._get_iterator()

File e:\ANACONDA\envs\deeplearning\Lib\site-packages\torch\utils\data\dataloader.py:386, in DataLoader._get_iterator(self) 385 self.check_worker_number_rationality() --> 386 return _MultiProcessingDataLoaderIter(self)

File e:\ANACONDA\envs\deeplearning\Lib\site-packages\torch\utils\data\dataloader.py:1039, in _MultiProcessingDataLoaderIter.init(self, loader) 1033 # NB: Process.start() actually take some time as it needs to 1034 # start a process and pass the arguments over via a pipe. 1035 # Therefore, we only add a worker to self._workers list after 1036 # it started, so that we do not call .join() if program dies 1037 # before it starts, and del tries to join but will get: 1038 # AssertionError: can only join a started process. -> 1039 w.start() 1040 self._index_queues.append(index_queue)

File e:\ANACONDA\envs\deeplearning\Lib\multiprocessing\process.py:121, in BaseProcess.start(self) 120 _cleanup() --> 121 self._popen = self._Popen(self) 122 self._sentinel = self._popen.sentinel

File e:\ANACONDA\envs\deeplearning\Lib\multiprocessing\context.py:224, in Process._Popen(process_obj) 222 @staticmethod 223 def _Popen(process_obj): --> 224 return _default_context.get_context().Process._Popen(process_obj)

File e:\ANACONDA\envs\deeplearning\Lib\multiprocessing\context.py:336, in SpawnProcess._Popen(process_obj) 335 from .popen_spawn_win32 import Popen --> 336 return Popen(process_obj)

File e:\ANACONDA\envs\deeplearning\Lib\multiprocessing\popen_spawn_win32.py:94, in Popen.init(self, process_obj) 93 reduction.dump(prep_data, to_child) ---> 94 reduction.dump(process_obj, to_child) 95 finally:

File e:\ANACONDA\envs\deeplearning\Lib\multiprocessing\reduction.py:60, in dump(obj, file, protocol) 59 '''Replacement for pickle.dump() using ForkingPickler.''' ---> 60 ForkingPickler(file, protocol).dump(obj)

PicklingError: Can't pickle <class 'pyannote.database.registry.dis'>: attribute lookup dis on pyannote.database.registry failed

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last) Cell In[59], line 55 44 # trainer = Trainer(accelerator="gpu", 45 # callbacks=callbacks, 46 # max_epochs=20, 47 # gradient_clip_val=0.5) 49 trainer = Trainer(accelerator="gpu", 50 callbacks=callbacks, 51 max_epochs=20, 52 gradient_clip_val=0.5, 53 num_sanity_val_steps=0) # Skip sanity check validation ---> 55 trainer.fit(segmentation_model)

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\trainer\trainer.py:543, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path) 541 self.state.status = TrainerStatus.RUNNING 542 self.training = True --> 543 call._call_and_handle_interrupt( 544 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path 545 )

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\trainer\call.py:68, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs) 66 for logger in trainer.loggers: 67 logger.finalize("failed") ---> 68 trainer._teardown() 69 # teardown might access the stage so we reset it after 70 trainer.state.stage = None

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\trainer\trainer.py:1013, in Trainer._teardown(self) 1011 # loop should never be None here but it can because we don't know the trainer stage with ddp_spawn 1012 if loop is not None: -> 1013 loop.teardown() 1014 self._logger_connector.teardown() 1015 self._signal_connector.teardown()

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\loops\fit_loop.py:411, in _FitLoop.teardown(self) 409 def teardown(self) -> None: 410 if self._data_fetcher is not None: --> 411 self._data_fetcher.teardown() 412 self._data_fetcher = None 413 self.epoch_loop.teardown()

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\loops\fetchers.py:79, in _DataFetcher.teardown(self) 78 def teardown(self) -> None: ---> 79 self.reset() 80 if self._combined_loader is not None: 81 self._combined_loader.reset()

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\loops\fetchers.py:141, in _PrefetchDataFetcher.reset(self) 139 @override 140 def reset(self) -> None: --> 141 super().reset() 142 self.batches = []

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\loops\fetchers.py:75, in _DataFetcher.reset(self) 73 # teardown calls reset(), and if it happens early, combined_loader can still be None 74 if self._combined_loader is not None: ---> 75 self.length = sized_len(self.combined_loader) 76 self.done = self.length == 0

File ~\AppData\Roaming\Python\Python311\site-packages\lightning_fabric\utilities\data.py:51, in sized_len(dataloader) 48 """Try to get the length of an object, return None otherwise.""" 49 try: 50 # try getting the length ---> 51 length = len(dataloader) # type: ignore [arg-type] 52 except (TypeError, NotImplementedError): 53 length = None

File ~\AppData\Roaming\Python\Python311\site-packages\pytorch_lightning\utilities\combined_loader.py:358, in CombinedLoader.len(self) 356 """Compute the number of batches.""" 357 if self._iterator is None: --> 358 raise RuntimeError("Please call iter(combined_loader) first.") 359 return len(self._iterator)

RuntimeError: Please call iter(combined_loader) first.

I cannot understand what is causing this error. Any help would be appreciated. Thank You

FrenchKrab commented 7 months ago

I'm not entirely sure where it might come from, but you can try changing save_top_k=1 to save_top=0 or =-1, in order to save all/no checkpoint without consulting the metric (that isn't available in your case), maybe save_last=False is what you're searching for. If that doesn't work, you can change the monitored metric (monitor and mode) to something that's logged at training time like train/loss, that might help ?

Ashh-Z commented 7 months ago

I'm not entirely sure where it might come from, but you can try changing save_top_k=1 to save_top=0 or =-1, in order to save all/no checkpoint without consulting the metric (that isn't available in your case), maybe save_last=False is what you're searching for. If that doesn't work, you can change the monitored metric (monitor and mode) to something that's logged at training time like train/loss, that might help ?

Hi, I got around this error by setting the parameter num_workers = 0 for segmentation task.

segmentation_model.task = Segmentation(protocol,num_workers=0, duration=5.0, max_speakers_per_chunk=5, max_speakers_per_frame=3)

But now I am facing a different error :

ValueError: requested chunk [642406.912774s, 642411.912774s] (frames #10278510604 to #10278590604) lies outside of M028 file bounds [0., 1536.575000s] (24585200 frames).

On different runs, I get this error on different files.

Ashh-Z commented 7 months ago

I'm not entirely sure where it might come from, but you can try changing save_top_k=1 to save_top=0 or =-1, in order to save all/no checkpoint without consulting the metric (that isn't available in your case), maybe save_last=False is what you're searching for. If that doesn't work, you can change the monitored metric (monitor and mode) to something that's logged at training time like train/loss, that might help ?

Hi, I got around this error by setting the parameter num_workers = 0 for segmentation task.

segmentation_model.task = Segmentation(protocol,num_workers=0, duration=5.0, max_speakers_per_chunk=5, max_speakers_per_frame=3)

But now I am facing a different error :

ValueError: requested chunk [642406.912774s, 642411.912774s] (frames #10278510604 to #10278590604) lies outside of M028 file bounds [0., 1536.575000s] (24585200 frames).

On different runs, I get this error on different files.

Is this due to problems with the segmentation model ? The requested chunks on all these erroneous files seem to be way higher than the actual length of the audio file. The actual audio files are mostly around 30min long.

A similar issue was also reported here. But, in this it seems that the chunks requested are near the end of the audio file, that is not the case with my audio files.

Ashh-Z commented 7 months ago

Got this resolved, it was a issue with how I generated the UEM files. Thank You