baaivision / Emu

Emu Series: Generative Multimodal Models from BAAI
https://baaivision.github.io/emu2/
Apache License 2.0
1.61k stars 84 forks source link

Too Many Request when Downloading YT-SB-1b #72

Open LengSicong opened 8 months ago

LengSicong commented 8 months ago

Thanks for your work! I try to use video2dataset to download YT-Temporal-1B. However, it reports too many requests while downloading... Could you give me some advice on how to fix this problem?

SlotherCui commented 8 months ago

If you are using the official video2dataset script to download raw videos, YouTube may restrict your request frequency, resulting in too many requests issues. To address this problem, you can consider employing techniques such as setting up IP proxies to alleviate the restrictions. However, when constructing YT-SB-1B, we only made requests to the interface responsible for obtaining storyboard images. Fortunately, this specific interface does not impose restrictions on the number of requests(at least not during our crawling process).

LengSicong commented 8 months ago

Hi, thanks for your prompt reply. May I know how I can just make requests to the interface responsible for obtaining storyboard images? Since the official instruction given here is using video2dataset for downloading storyboard images.

SlotherCui commented 8 months ago

We use the thumbframes_dl

LengSicong commented 8 months ago

May I know if the storyboard images downloaded through thumbframes_dl contain the time stamp information, which may be used to construct the interleaved video-text data in the next step?

SlotherCui commented 8 months ago

You can refer to this code , The time intervals of storyboard images are continuous and fixed, and the timestamps can be inferred.

clownrat6 commented 8 months ago

Hello, I meet the same problem ("HTTPError: 429 Client Error: Too Many Requests for url: xxx") when downloading subtitles. Is there any advice?

SlotherCui commented 8 months ago

Certainly. The most widely-used and effective solution is to set up IP proxies. However, this requires purchasing IP proxy services. Another approach is to extend the interval between requests. Adjusting the request frequency might help alleviate the issue