Question about the "Explicit Uniform Sampling" in the report

THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Apache License 2.0

8.07k stars 760 forks source link

Question about the "Explicit Uniform Sampling" in the report #91

Closed StarCycle closed 2 months ago

StarCycle commented 2 months ago

Hello @zRzRzRzRzRzRzR ,

As you mentioned in section 3.3 of the report, leting each process samples a value between 1~T is not uniform enough. A process is corresponded to a GPU in data parallelism.

Can you explain it in detailed? Why multi-process random value sampling will be a problem here? Could you please share some links about this issue, if other developers have the same problem?

Best wishes, StarCycle

tengjiayan20 commented 2 months ago

Just imagine a simple situation, when you randomly sample four numbers in 0-1000, it is possible that they are four similar numbers, like 10, 20, 30, 40, but if you randomly sample a number in 0-250 and 250-500, 500-750, 750-1000 respectively, you will definitely get 4 numbers with more uniform distribution.

And our ablation study in Section 3.3 of the paper and Figure 3(d) verifies this theory.

StarCycle commented 2 months ago

Hi @tengjiayan20,

Thanks for your response! But I am still curious:

when you randomly sample four numbers in 0-1000, it is possible that they are four similar numbers, like 10, 20, 30, 40,

Why will it happen? Is it a specific problem of your cluster, or it's a common problem (if so, did someone ask it at StackOverflow or somewhere else)?

For data parallelism, you can let the main process sample multiple seeds and send the seeds to other processes. And you will still get "similar numbers"?

tengjiayan20 commented 2 months ago

Here, "similar numbers" only refers to an extreme case with a lower probability. What I mean is that uniform sampling in each interval is equivalent to adding an additional prior condition (i.e. segmentation) to the uniform sampling within 0-1000. This operation will make the sampled time steps more uniform, and the value of the diffusion training loss is more sensitive to the time step, so doing this can stabilize the training.

StarCycle commented 2 months ago

I see. Thanks for this answer.