dvmazur / mixtral-offloading

Run Mixtral-8x7B models in Colab or consumer desktops
MIT License
2.29k stars 227 forks source link

Hard to benchmark the operation in the repo #39

Open mynotwo opened 3 months ago

mynotwo commented 3 months ago

Hi, thanks for your work! I recently wanna benchmark each step's latency of this repo, and I found if I use torch.cuda.synchonize() and time.time(), I cannot get the actual data copy time.

For example, I believe the data copy time is those two lines.

    device_expert_buffer.storage.copy_(self.offloaded_storages[info_to_load.index], non_blocking=True)
    offloaded_storage_buffer.copy_(self.main_modules[info_to_evict.index].storage, non_blocking=True)

And time.time gives me 1e-5s, which I believe is far faster than real data transfer latency. I think the reason might be there exist multiple process/threads and would lead to wrong latency. Could you help me solve this problem?

Many thanks!

dvmazur commented 3 months ago

Hi! In this case the .copy_ operation is non-blocking. Meaning it doesn't wait for the underlying copy to finish, but lets the python thread proceed as soon as the operation is submitted. You might want to look into torch's profiler. I recommend you export your traces into json and view them using perfetto or chrome://tracing.