Request for AdamW8bit support on CPU (would help TorchTune)

sanchitintel commented 1 month ago

Feature request

Port AdamW8bit support for CPU from multi-backend-refactor branch to the main branch

Motivation

Public cloud providers' machines with GPUs are usually expensive while datacenter-grade CPUs are more readily available at lower prices. Towards the goal of making Deep Learning more accessible to developers & learners, the ability to finetune with AdamW8bit on CPU seems like a good milestone. TorchTune is currently unable to support full fine-tuning on CPU with AdamW8bit because it uses bitsandbytes' AdamW8bit optimizer.

~~#898 enabled AdamW8bit for CPU in multi-backend-refactor branch, but the main branch doesn't have it.~~

It'd be great if we could enable AdamW8bit for CPU in bitsandbytes main branch before TorchTune's next release (provided there would be a bitsandbytes release before that), so that users who'd install TorchTune would automatically end up installing a version of bitsandbytes that'd support AdamW8bit on CPU.

Thanks!

Your contribution

@jianan-gu could port over his code from multi-backend-refactor branch to the main branch.

cc @mingfeima @ashokei @TimDettmers

sanchitintel commented 1 month ago

~~#1220 will fix this issue.~~

matthewdouglas commented 1 month ago

1220 will fix this issue.

I don't recall seeing any optimizers implemented yet for CPU, but may be mistaken.

Paged optimizer doesn't make sense to me for CPU, but I can understand the request for AdamW8bit.

sanchitintel commented 1 month ago

Thanks for pointing that out, @matthewdouglas! I've revised the description.

@jianan-gu @xia-weiwen, please clarify if you had added AdamW8bit implementation for CPU to bitsandbytes. If not, do you have plans to add it? Thanks!

Xia-Weiwen commented 1 month ago

@sanchitintel Yes, we are going to do it. cc. @jianan-gu @jiqing-feng

Titus-von-Koeller commented 1 month ago

@sanchitintel thanks for raising this. When is the next torchtune release foreseen?

Hmm, the problem is that the device abstraction / dispatcher situation is still not stable. Things will change fundamentally in the next 3 weeks. Not sure if this can be done as a PR to main in isolation? @Xia-Weiwen could you sketch out a bit more how you think this would make sense?

TimDettmers / bitsandbytes