Hard to benchmark the operation in the repo

dvmazur / mixtral-offloading

Run Mixtral-8x7B models in Colab or consumer desktops

MIT License

2.29k stars 227 forks source link

Hi, thanks for your work! I recently wanna benchmark each step's latency of this repo, and I found if I use torch.cuda.synchonize() and time.time(), I cannot get the actual data copy time.

For example, I believe the data copy time is those two lines.

    device_expert_buffer.storage.copy_(self.offloaded_storages[info_to_load.index], non_blocking=True)
    offloaded_storage_buffer.copy_(self.main_modules[info_to_evict.index].storage, non_blocking=True)

And time.time gives me 1e-5s, which I believe is far faster than real data transfer latency. I think the reason might be there exist multiple process/threads and would lead to wrong latency. Could you help me solve this problem?

Many thanks!

dvmazur / mixtral-offloading

Hard to benchmark the operation in the repo #39