PKU-YuanGroup / Open-Sora-Plan

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
MIT License
11.29k stars 1.01k forks source link

missing image data for training #312

Open quantumiracle opened 3 months ago

quantumiracle commented 3 months ago

Hi,

When launching the t2v training, the it also requires to specify an image data path as here. However, in the HuggingFace dataset repo there is no image-text dataset, which leads to error when launching training:

FileNotFoundError: [Errno 2] No such file or directory: '/dxyl_data02/anno_jsons/human_images_162094.json'

How to fix this?

quantumiracle commented 3 months ago

Another error I got from t2v training is:

/opensora/dataset/t2v_datasets.py", line 76, in get_video
    frame_idx = self.vid_cap_list[idx]['frame_idx']
KeyError: 'frame_idx'

where frame_idx does not exists in the json file.

LinB203 commented 3 months ago

Hi,

When launching the t2v training, the it also requires to specify an image data path as here. However, in the HuggingFace dataset repo there is no image-text dataset, which leads to error when launching training:

FileNotFoundError: [Errno 2] No such file or directory: '/dxyl_data02/anno_jsons/human_images_162094.json'

How to fix this?

https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0/blob/main/anno_jsons/human_images_162094.json

LinB203 commented 3 months ago

Another error I got from t2v training is:

/opensora/dataset/t2v_datasets.py", line 76, in get_video
    frame_idx = self.vid_cap_list[idx]['frame_idx']
KeyError: 'frame_idx'

where frame_idx does not exists in the json file.

Do you use the v1.1's code? The code of v1.1 should use annotation from here.

quantumiracle commented 3 months ago

Thanks for quick reply.

It seems I'm using v1.0 dataset.

quantumiracle commented 3 months ago

Hi,

when I'm trying to download v1.1 dataset with:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="LanguageBind/Open-Sora-Plan-v1.1.0", repo_type="dataset", local_dir=data_dir)

I got error:

...
Fetching 117685 files:  12%|████████████▉                                                                                               | 14039/117685 [14:29<1:46:57, 16.15it/s]
4173980_resize1080p.mp4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.55M/4.55M [00:00<00:00, 247MB/s]
4173972_resize1080p.mp4: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.49M/5.49M [00:00<00:00, 40.7MB/s]
4173976_resize1080p.mp4: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.94M/8.94M [00:00<00:00, 31.9MB/s]
4173975_resize1080p.mp4: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.7M/10.7M [00:00<00:00, 45.5MB/s]
4173977_resize1080p.mp4: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.3M/12.3M [00:00<00:00, 66.7MB/s]
4173973_resize1080p.mp4: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.1M/26.1M [00:00<00:00, 83.2MB/s]
4173981_resize1080p.mp4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18.8M/18.8M [00:00<00:00, 126MB/s]
4173982_resize1080p.mp4: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24.6M/24.6M [00:00<00:00, 298MB/s]
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
    response.raise_for_status()
    raise HfHubHTTPError(message, response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError:

403 Forbidden: None.
Cannot access content at: https://cdn-lfs-us-1.huggingface.co/repos/d1/a4/d1a47faaa1475f32c7e503cebcd6029bdf94c4a148ceb23e2f5e052d50d3f02a/dc4d652445209b5ad6ad292bc6755cc067abf187c739ee2bf8e8b75b3b2a9d90?response-content-disposition=inline%3B+filename*%3DUTF-8%27%274173971_resize1080p.mp4%3B+filename%3D%224173971_resize1080p.mp4%22%3B&response-content-type=video%2Fmp4&Expires=1718648001&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxODY0ODAwMX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2QxL2E0L2QxYTQ3ZmFhYTE0NzVmMzJjN2U1MDNjZWJjZDYwMjliZGY5NGM0YTE0OGNlYjIzZTJmNWUwNTJkNTBkM2YwMmEvZGM0ZDY1MjQ0NTIwOWI1YWQ2YWQyOTJiYzY3NTVjYzA2N2FiZjE4N2M3MzllZTJiZjhlOGI3NWIzYjJhOWQ5MD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=dnYc6UEjLcqCarh~mWlzSybLf505FdK8ClHTvIKnnY4Pc2nMLnsp5fxAUSLz3u24xOSQoykAxOG2h2kKCgMG-yKe4bUGRkLrLNJwn75Xl1C5L2iza3-wE6LlnDAre6Ju81QWolv1Wy6fIK0OHWJVMhIHquUqKyMHiaOXl7CLktLQg0POb-wga8HB9HFLDdsUm~1a2uH2mSAOcdAQz9teTMOJ4HCIOfwuuPaJiYK0g0NPeiddWMP4U~8R3cgghVLzq67YFrmmdcpT6Rv-K1F4LE4nLIo9LwQmATHzbI2y1Xgmzs9wFN4U7aGJ6Hq7avfaFplKLOK7nvV-enaJ-t0EOA__&Key-Pair-Id=K2FPYV99P2N66Q.
If you are trying to create or update content,make sure you have a token with the `write` role.
LinB203 commented 3 months ago

It seems that it is a network error? Btw, now the full pexel datasets do not upload completely.

quantumiracle commented 3 months ago

Hi,

I think this is an access issue instead of network problem since it reports:

If you are trying to create or update content,make sure you have a token with the `write` role.

I tried both with snapshot_download and git clone directly, and both give this error. Any idea on why this happens?

LinB203 commented 3 months ago

I checked the program that was uploading data in the background and it interrupted, maybe there is some unknown error. I'm trying to fix it, maybe we should be uploading zip files instead of video files.

quantumiracle commented 3 months ago

Yes, compressed tar.gz would be good

Also it may be good to host each dataset with different url, and provide a downloading script. Trying to download the entire dataset and got interrupt in the middle will take a long time.

physercoe commented 3 months ago

Hi,

I think this is an access issue instead of network problem since it reports:

If you are trying to create or update content,make sure you have a token with the `write` role.

I tried both with snapshot_download and git clone directly, and both give this error. Any idea on why this happens?

i met the same question, please provide the compressed tar.gz files instead of a lot of seperate small files

quantumiracle commented 3 months ago

@LinB203 Hi, when will the dataset be ready? I could help with curating the data if you need.

LinB203 commented 3 months ago

Hi all, due to pexel data upload exception. We decided to package it and upload it again. This process will last about a week.

quantumiracle commented 3 months ago

The pixabay_v2 dataset also seems to be incomplete, for example when using video_pixabay_513f_51483.json, it reports:

Error with File not found: /LanguageBind/Open-Sora-Plan-v1.1.0_dataset/pixabay_v2/137509-766715277.mp4
LinB203 commented 3 months ago

The pixabay_v2 dataset also seems to be incomplete, for example when using video_pixabay_513f_51483.json, it reports:

Error with File not found: /LanguageBind/Open-Sora-Plan-v1.1.0_dataset/pixabay_v2/137509-766715277.mp4

We have packaged the whole pixabay data into 50 .tar.gz file. The 50 .tar.gz files should be unpackaged and move all videos to one folder.

quantumiracle commented 3 months ago

The pixabay_v2 dataset also seems to be incomplete, for example when using video_pixabay_513f_51483.json, it reports:

Error with File not found: /LanguageBind/Open-Sora-Plan-v1.1.0_dataset/pixabay_v2/137509-766715277.mp4

We have packaged the whole pixabay data into 50 .tar.gz file. The 50 .tar.gz files should be unpackaged and move all videos to one folder.

Yes, I actually did that and receive above error.

LinB203 commented 3 months ago

The pixabay_v2 dataset also seems to be incomplete, for example when using video_pixabay_513f_51483.json, it reports:

Error with File not found: /LanguageBind/Open-Sora-Plan-v1.1.0_dataset/pixabay_v2/137509-766715277.mp4

We have packaged the whole pixabay data into 50 .tar.gz file. The 50 .tar.gz files should be unpackaged and move all videos to one folder.

Yes, I actually did that and receive above error.

Could you tell me how many pixaby videos did you download? I will check the number.

LinB203 commented 3 months ago

The pixabay_v2 dataset also seems to be incomplete, for example when using video_pixabay_513f_51483.json, it reports:

Error with File not found: /LanguageBind/Open-Sora-Plan-v1.1.0_dataset/pixabay_v2/137509-766715277.mp4

We have packaged the whole pixabay data into 50 .tar.gz file. The 50 .tar.gz files should be unpackaged and move all videos to one folder.

Yes, I actually did that and receive above error.

We found the hf repo lost a .tar.gz file and we have uploaded it just now. Check that please. https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0/blob/main/pixabay_v2_tar/folder_16.tar.gz

roundchuan commented 3 months ago

Hi, When launching the t2v training, the it also requires to specify an image data path as here. However, in the HuggingFace dataset repo there is no image-text dataset, which leads to error when launching training:

FileNotFoundError: [Errno 2] No such file or directory: '/dxyl_data02/anno_jsons/human_images_162094.json'

How to fix this?

https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0/blob/main/anno_jsons/human_images_162094.json

I can‘t find others image datasets except "Human_images". Can you provide the "SAM-11M" and "Anytext-3M"~

quantumiracle commented 3 months ago

The pixabay_v2 dataset also seems to be incomplete, for example when using video_pixabay_513f_51483.json, it reports:

Error with File not found: /LanguageBind/Open-Sora-Plan-v1.1.0_dataset/pixabay_v2/137509-766715277.mp4

We have packaged the whole pixabay data into 50 .tar.gz file. The 50 .tar.gz files should be unpackaged and move all videos to one folder.

Yes, I actually did that and receive above error.

We found the hf repo lost a .tar.gz file and we have uploaded it just now. Check that please. https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0/blob/main/pixabay_v2_tar/folder_16.tar.gz

Found that the error is actually caused by the name of mp4 files in json, where _resize1080p should be added. This should fix the json:

jq 'map(.path |= sub(".mp4"; "_resize1080p.mp4"))' video_pixabay_65f_601513.json > updated_video_pixabay_65f_601513.json
LinB203 commented 3 months ago

Hi, all. The dataset has already uploaded. For information on dataset format and download instructions, please refer to this.

quantumiracle commented 2 months ago

@LinB203 Hi, I think these uploaded tar files are broken in pexels: 5000-8 5000-39 5000-46 5000-53 5000-57 5000-64 5000-69 5000-92

I got error when cat *tar.part* > *.tar:

tar: Skipping to next header
tar: Archive contains ‘>\251\366\277{\235\315\f’ where numeric mode_t value expected
tar: Archive contains ‘\275'أ\245\324\037\273MP\226Q’ where numeric time_t value expected
tar: Archive contains ‘\256\200<4]\261\231\363’ where numeric uid_t value expected
...

Could you check those files and reupload the correct ones?

Thanks