Closed DifferentialityDevelopment closed 1 month ago
I do need to do some alterations as it doesn't currently build on Linux, but it should be minor changes. Though I've confirmed that Windows seems to work correctly at least.
Nice job @DifferentialityDevelopment!
First thoughts:
pthreads-w32
dependency, this project uses just 2 functions from pthread.h
. You can easily replace them by using CreateThread
and WaitForSingleObject
from windows.h
. You can check how it's done in llama.cpp.allocateAndLockMemory
function. I didn't observe any significant improvement by using posix_memalign
, maybe this could be removed. Maybe calloc
would be enought? Or maybe here we need some tests.Yeah the changes in utils.cpp were messy, haven't yet gotten around to working my backwards so that it changes the least amount of code while still working. Malloc just wouldn't work with the size of what needed to be allocated and I wasn't having much luck with the windows API functions for allocating large amount of memory and ended up just using the simple vector approach which worked like a charm.
It's a first working draft, can modify it forwards to remove the dependency on pthreads-win32
Still need to do some more testing and more refinement, but it does at least build again now on both Linux & Windows, also removed pthreads-win32 dependency
Can we add Windows to .github/workflows/main.yml
?
Can we add Windows to
.github/workflows/main.yml
?
That would be great, then we know if a change breaks either platform :) I'll do some refactoring on the code tonight to clean it up a bit.
I think the main things left to refactor are the changes in transformers.cpp & utils.cpp
I refactored utils.cpp, gracefullyAllocateBuffer acts like a fallback for allocating the memory buffer, which also sorts out the weirdness that happens if you happen to not run distributed-llama as sudo.
I was able to run dllama without running it as sudo in Linux using this approach
./dllama inference --model /mnt/d/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer /mnt/d/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --steps 64 --prompt "Hello world" 💡 arch: llama2 💡 dim: 4096 💡 hiddenDim: 14336 💡 nLayers: 32 💡 nHeads: 32 💡 nKvHeads: 8 💡 vocabSize: 128256 💡 seqLen: 2048 💡 nSlices: 1 💡 ropeTheta: 500000.0 📄 bosId: 128000 📄 eosId: 128001 mmap succeeded. data = 0x7f8c7260a000 weights = 0x7f8c7260a060 🕒 ropeCache: 32768 kB ⏩ Loaded 6175568 kB 🔶 G 421 ms I 421 ms T 0 ms S 0 kB R 0 kB Hello 🔶 G 382 ms I 382 ms T 0 ms S 0 kB R 0 kB world 🔶 G 421 ms I 420 ms T 0 ms S 0 kB R 0 kB ! 🔶 G 385 ms I 384 ms T 0 ms S 0 kB R 0 kB This 🔶 G 390 ms I 389 ms T 0 ms S 0 kB R 0 kB is 🔶 G 377 ms I 377 ms T 0 ms S 0 kB R 0 kB a 🔶 G 389 ms I 387 ms T 1 ms S 0 kB R 0 kB test 🔶 G 395 ms I 395 ms T 0 ms S 0 kB R 0 kB of 🔶 G 381 ms I 380 ms T 1 ms S 0 kB R 0 kB the 🔶 G 376 ms I 374 ms T 1 ms S 0 kB R 0 kB emergency 🔶 G 453 ms I 451 ms T 2 ms S 0 kB R 0 kB broadcast 🔶 G 421 ms I 420 ms T 1 ms S 0 kB R 0 kB system 🔶 G 423 ms I 421 ms T 1 ms S 0 kB R 0 kB .
I've updated the readme as well.
Please let me know when I can review again (still for example main.yml is not updated).
Please let me know when I can review again (still for example main.yml is not updated).
Your welcome to review again.
I haven't much experience with Github workflows, but will try and update main.yml to include a windows build 👍
I reverted .github/workflows/main.yml
to the previous approach. The new approach didn't test all Linux CPUs.
@b4rtaz It seems you removed a bit of code that was necessary in transformers.cpp
Without it I am unable to load the model files.
./dllama-api.exe --model D:\openchat-3.6-8b-20240522-distributed\dllama_model_openchat-3.6-8b-20240522_q40.m --tokenizer D:\openchat-3.6-8b-20240522-distributed\dllama_tokenizer_llama3.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --chat-template openchat3 --port 10111 ƒÆí arch: llama ƒÆí hiddenAct: silu ƒÆí dim: 4096 ƒÆí hiddenDim: 14336 ƒÆí nLayers: 32 ƒÆí nHeads: 32 ƒÆí nKvHeads: 8 ƒÆí vocabSize: 128256 ƒÆí seqLen: 8192 ƒÆí nSlices: 1 ƒÆí ropeTheta: 500000.0 terminate called after throwing an instance of 'std::runtime_error' what(): Error determining model file size
If I add it back in then it works.
@DifferentialityDevelopment ah, sorry! But how this is able to compile now? If #define ftell(fp) _ftelli64(fp)
overrides anything, I think it would be better to create an own function, like seekToEnd
or similiar (with #ifdef _WIN32 ... else
).
@DifferentialityDevelopment ah, sorry! But how this is able to compile now? If
#define ftell(fp) _ftelli64(fp)
overrides anything, I think it would be better to create an own function, likeseekToEnd
or similiar (with#ifdef _WIN32 ... else
).
It compiles fine because the arguments for both functions are the same, so on Windows, it's usually better to use _ftelli64 and _fseeki64 instead of the standard C++ functions.
That said if an argument happens to end on ftell it would be overwritten, which I guess is where your coming from.
Maybe something like:
static inline long fileGetLength(FILE fp){ return _ftelli64(fp); } static inline long fileSeekEnd(FILE fp, long offset, int origin){ return _fseeki64(fp, offset, origin); }
static inline long fileGetLength(FILE fp){ return ftell(fp); } static inline long fileSeekEnd(FILE fp, long offset, int origin){ return fseek(fp, offset, origin); }
Does this PR solve the problem?
I'll give it a try, but it looks like it should work just fine 👍
Does this PR solve the problem?
I'll give it a try, but it looks like it should work just fine 👍
Not sure what's going on exactly now, I've just tried it out but now I'm getting "Cannot open file" error, will check if I did something wrong.
@b4rtaz I figured it out, you forgot to add fclose at the end of loadSpecFromFile
Other than that it works perfect, thank you!
@DifferentialityDevelopment thanks for help and sorry the problem. Probably I need some Windows envenvironment.
Was able to get it working on Windows, still need to do some cleaning up and more thorough testing.