EleutherAI / elk

Keeping language models honest by directly eliciting knowledge encoded in their activations.
MIT License
178 stars 33 forks source link

WIP: Implement FSDP, drop usage of GeneratorBuilder, DIY caching #221

Closed thejaminator closed 1 year ago

thejaminator commented 1 year ago

Unfortunately this is pretty big. Had to change a couple of stuff to get FSDP working.

You can run it out with

elk elicit huggyllama/llama-{7b,13b,30b,65b} imdb --fsdp_enabled --num_gpus {2-8}

Issues

Figuring out the memory required

The min_gpu_mem can be passed. I indicates the memory for the whole model.

--min_gpu_mem {memory_required_for_whole_model}

mkl

You may encounter an error like this: Github issue

Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.

To fix, run this before running. Still figuring out why this is happening. Its supposed to be fixed with the latest mkl package, but it aint for me.

export MKL_THREADING_LAYER=GNU

too many open files

image

Sometimes it'll complain about too many open files inccrease the ulimit

ulimit -n 4048

❤️ QA instructions

checkout to this branch refactor-datasets-usage Run elicit with huggyllama 7b with these variations. For each of the runs, check that the eval.csv are roughly the same. and lmk if it crashes. Note that we are disabling the cache for extracting here. Otherwise subsequent elicit runs won't actually run the extraction with llama, it will just reuse it.

With fsdp. This shards the model on each device.

elk elicit huggyllama/llama-7b imdb --fsdp_enabled --num_gpus 2 --disable_cache

Without fsdp, but multigpu. This duplicates the model on each device

elk elicit huggyllama/llama-7b imdb ----num_gpus 2 --disable_cache

Now compare this to the main branch. Does llama-7b take significantly slower?

If the above seems to work without crashing, and if you are feeling ambitious. you can merge in the latest changes into this branchand fix the conflicts. may be confusing though.