a-r-r-o-w / cogvideox-factory

Memory optimized finetuning scripts for CogVideoX using TorchAO and DeepSpeed
Apache License 2.0
177 stars 16 forks source link

AttributeError: Can't pickle local object 'VideoDataset.__init__.<locals>.<lambda>' #23

Open Nojahhh opened 1 day ago

Nojahhh commented 1 day ago

System Info / 系統信息

CUDA 12.1 4090 Python 11 Windows torch==2.4.0+cu121 transformers==4.44.2 diffusers==0.31.0.dev0 accelerate==1.0.0

Information / 问题信息

Reproduction / 复现过程

I'm not sure what I'm doing wrong here. I have prepared a dataset of 52 mp4 videos, each 720x480, 8fps and 49 frames. All prompts in prompts.txt and videos paths in videos.txt and no problem to load. No tensor preparation.

Issue comes at the beggining of training. It's just about to start its first steps and then I get the error about local object that can't be pickled.

Traceback (most recent call last): File "C:\Users\melli\Documents\VIKI\viki.ai\cogvideox-factory\training\cogvideox_image_to_video_lora.py", line 956, in <module> main(args) File "C:\Users\melli\Documents\VIKI\viki.ai\cogvideox-factory\training\cogvideox_image_to_video_lora.py", line 634, in main for step, batch in enumerate(train_dataloader): File "C:\Python311\Lib\site-packages\accelerate\data_loader.py", line 547, in __iter__ dataloader_iter = self.base_dataloader.__iter__() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 440, in __iter__ return self._get_iterator() ^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator return _MultiProcessingDataLoaderIter(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1038, in __init__ w.start() File "C:\Python311\Lib\multiprocessing\process.py", line 121, in start self._popen = self._Popen(self) ^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\multiprocessing\context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\multiprocessing\context.py", line 336, in _Popen return Popen(process_obj) ^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 94, in __init__ reduction.dump(process_obj, to_child) File "C:\Python311\Lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'VideoDataset.__init__.<locals>.<lambda>'

Expected behavior / 期待表现

Functional training based on instructions

Nojahhh commented 1 day ago

Found the issue.

Updated transformers from 4.44.2 to 4.45.2 (there is a typo in requirements.txt showing transformers 0.45.2) and needed to upgrade torchao to 0.5.0. RuntimeError: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback): cannot import name 'quantize_' from 'torchao.quantization' (C:\Python311\Lib\site-packages\torchao\quantization\__init__.py)

Can't install torchao 0.5.0 on windows. (They introduced quantize_ in torchao 0.4.0) ERROR: Could not find a version that satisfies the requirement torchao==0.5.0 (from versions: 0.0.1, 0.0.3, 0.1)

Torchao 0.5.0 is not supported on windows yet because of triton lack of windows support I read in another thread.

So as of now this training script is only suitable for Linux.

a-r-r-o-w commented 1 day ago

Thanks for the PR correcting the versions! I heard from a user on Reddit that they were able to get it to run with WSL (I can't verify because I don't have a windows device unfortunately). I'm not sure about the status of triton on windows (but I think there are community efforts to make it work). I think you should be able to train in under 24 GB even without torchao/triton as long as you precompute latents and prompt embeddings.

sayakpaul commented 17 hours ago

I think this is safe to close now?

Nojahhh commented 2 hours ago

I'm sorry. It seems I was wrong to close this. This is still an issue where local object cannot be pickled.

Based on some research online, this is related to windows. Link

a-r-r-o-w commented 1 hour ago

Based on that discussion, it seems like lambda functions don't play well with pickle. I think I know how this might be fixed, but since I don't have Windows, I will need your help in trying out things so we might have to do a bit of back and forth on this. I hope you don't mind.

Could you try replacing the lambda function that changes range of data from 0-255 to 0.0-1.0 using torchvision lambda transform? Instead of using a python lambda function, could you create a local function instead?

def normalize(x):
    return x / 255.0

...
transforms.Lambda(normalize)