b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

feat: windows support #63

Closed DifferentialityDevelopment closed 1 month ago

DifferentialityDevelopment commented 1 month ago

Was able to get it working on Windows, still need to do some cleaning up and more thorough testing.

DifferentialityDevelopment commented 1 month ago

I do need to do some alterations as it doesn't currently build on Linux, but it should be minor changes. Though I've confirmed that Windows seems to work correctly at least.

b4rtaz commented 1 month ago

Nice job @DifferentialityDevelopment!

First thoughts:

DifferentialityDevelopment commented 1 month ago

Yeah the changes in utils.cpp were messy, haven't yet gotten around to working my backwards so that it changes the least amount of code while still working. Malloc just wouldn't work with the size of what needed to be allocated and I wasn't having much luck with the windows API functions for allocating large amount of memory and ended up just using the simple vector approach which worked like a charm.

It's a first working draft, can modify it forwards to remove the dependency on pthreads-win32

DifferentialityDevelopment commented 1 month ago

Still need to do some more testing and more refinement, but it does at least build again now on both Linux & Windows, also removed pthreads-win32 dependency

b4rtaz commented 1 month ago

Can we add Windows to .github/workflows/main.yml?

DifferentialityDevelopment commented 1 month ago

Can we add Windows to .github/workflows/main.yml?

That would be great, then we know if a change breaks either platform :) I'll do some refactoring on the code tonight to clean it up a bit.

DifferentialityDevelopment commented 1 month ago

I think the main things left to refactor are the changes in transformers.cpp & utils.cpp

DifferentialityDevelopment commented 1 month ago

I refactored utils.cpp, gracefullyAllocateBuffer acts like a fallback for allocating the memory buffer, which also sorts out the weirdness that happens if you happen to not run distributed-llama as sudo.

I was able to run dllama without running it as sudo in Linux using this approach

./dllama inference --model /mnt/d/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer /mnt/d/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --steps 64 --prompt "Hello world" 💡 arch: llama2 💡 dim: 4096 💡 hiddenDim: 14336 💡 nLayers: 32 💡 nHeads: 32 💡 nKvHeads: 8 💡 vocabSize: 128256 💡 seqLen: 2048 💡 nSlices: 1 💡 ropeTheta: 500000.0 📄 bosId: 128000 📄 eosId: 128001 mmap succeeded. data = 0x7f8c7260a000 weights = 0x7f8c7260a060 🕒 ropeCache: 32768 kB ⏩ Loaded 6175568 kB 🔶 G 421 ms I 421 ms T 0 ms S 0 kB R 0 kB Hello 🔶 G 382 ms I 382 ms T 0 ms S 0 kB R 0 kB world 🔶 G 421 ms I 420 ms T 0 ms S 0 kB R 0 kB ! 🔶 G 385 ms I 384 ms T 0 ms S 0 kB R 0 kB This 🔶 G 390 ms I 389 ms T 0 ms S 0 kB R 0 kB is 🔶 G 377 ms I 377 ms T 0 ms S 0 kB R 0 kB a 🔶 G 389 ms I 387 ms T 1 ms S 0 kB R 0 kB test 🔶 G 395 ms I 395 ms T 0 ms S 0 kB R 0 kB of 🔶 G 381 ms I 380 ms T 1 ms S 0 kB R 0 kB the 🔶 G 376 ms I 374 ms T 1 ms S 0 kB R 0 kB emergency 🔶 G 453 ms I 451 ms T 2 ms S 0 kB R 0 kB broadcast 🔶 G 421 ms I 420 ms T 1 ms S 0 kB R 0 kB system 🔶 G 423 ms I 421 ms T 1 ms S 0 kB R 0 kB .

DifferentialityDevelopment commented 1 month ago

I've updated the readme as well.

b4rtaz commented 1 month ago

Please let me know when I can review again (still for example main.yml is not updated).

DifferentialityDevelopment commented 1 month ago

Please let me know when I can review again (still for example main.yml is not updated).

Your welcome to review again.

I haven't much experience with Github workflows, but will try and update main.yml to include a windows build 👍

b4rtaz commented 1 month ago

I reverted .github/workflows/main.yml to the previous approach. The new approach didn't test all Linux CPUs.

DifferentialityDevelopment commented 1 month ago

@b4rtaz It seems you removed a bit of code that was necessary in transformers.cpp

ifdef _WIN32

define ftell(fp) _ftelli64(fp)

define fseek(fp, offset, origin) _fseeki64(fp, offset, origin)

endif

Without it I am unable to load the model files.

./dllama-api.exe --model D:\openchat-3.6-8b-20240522-distributed\dllama_model_openchat-3.6-8b-20240522_q40.m --tokenizer D:\openchat-3.6-8b-20240522-distributed\dllama_tokenizer_llama3.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --chat-template openchat3 --port 10111 ­ƒÆí arch: llama ­ƒÆí hiddenAct: silu ­ƒÆí dim: 4096 ­ƒÆí hiddenDim: 14336 ­ƒÆí nLayers: 32 ­ƒÆí nHeads: 32 ­ƒÆí nKvHeads: 8 ­ƒÆí vocabSize: 128256 ­ƒÆí seqLen: 8192 ­ƒÆí nSlices: 1 ­ƒÆí ropeTheta: 500000.0 terminate called after throwing an instance of 'std::runtime_error' what(): Error determining model file size

If I add it back in then it works.

b4rtaz commented 1 month ago

@DifferentialityDevelopment ah, sorry! But how this is able to compile now? If #define ftell(fp) _ftelli64(fp) overrides anything, I think it would be better to create an own function, like seekToEnd or similiar (with #ifdef _WIN32 ... else).

DifferentialityDevelopment commented 1 month ago

@DifferentialityDevelopment ah, sorry! But how this is able to compile now? If #define ftell(fp) _ftelli64(fp) overrides anything, I think it would be better to create an own function, like seekToEnd or similiar (with #ifdef _WIN32 ... else).

It compiles fine because the arguments for both functions are the same, so on Windows, it's usually better to use _ftelli64 and _fseeki64 instead of the standard C++ functions.

That said if an argument happens to end on ftell it would be overwritten, which I guess is where your coming from.

Maybe something like:

ifdef _WIN32

static inline long fileGetLength(FILE fp){ return _ftelli64(fp); } static inline long fileSeekEnd(FILE fp, long offset, int origin){ return _fseeki64(fp, offset, origin); }

else

static inline long fileGetLength(FILE fp){ return ftell(fp); } static inline long fileSeekEnd(FILE fp, long offset, int origin){ return fseek(fp, offset, origin); }

endif

b4rtaz commented 1 month ago

Does this PR solve the problem?

DifferentialityDevelopment commented 1 month ago

Does this PR solve the problem?

I'll give it a try, but it looks like it should work just fine 👍

DifferentialityDevelopment commented 1 month ago

Does this PR solve the problem?

I'll give it a try, but it looks like it should work just fine 👍

Not sure what's going on exactly now, I've just tried it out but now I'm getting "Cannot open file" error, will check if I did something wrong.

DifferentialityDevelopment commented 1 month ago

@b4rtaz I figured it out, you forgot to add fclose at the end of loadSpecFromFile

Other than that it works perfect, thank you!

b4rtaz commented 1 month ago

@DifferentialityDevelopment thanks for help and sorry the problem. Probably I need some Windows envenvironment.