WIP: Implement FSDP, drop usage of GeneratorBuilder, DIY caching

Unfortunately this is pretty big. Had to change a couple of stuff to get FSDP working.

We now call an InferenceServer which serves the model predictions. It works for both FSDP and just one model per gpu style serving.
We aren't depending on huggingface's dataset builder anymore. Previously we were using it for multiprocessing + one model per process. But now we're use our inferenceserver which manages it. And you can't use multiprocessing to call our InferencecServer on separate processes.
Instead we just manually get the results and create our dataset ourselves.
Because of that, we need to DIY our own cache
And to make sure our workers in the inferenceserver are fully utilized, when calling the inference server we call it on multiple threads. The InferenceServer is designed to be threadsafe, so hopefully it works.

You can run it out with

elk elicit huggyllama/llama-{7b,13b,30b,65b} imdb --fsdp_enabled --num_gpus {2-8}

Issues

Figuring out the memory required

The min_gpu_mem can be passed. I indicates the memory for the whole model.

--min_gpu_mem {memory_required_for_whole_model}

mkl

You may encounter an error like this: Github issue

Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.

To fix, run this before running. Still figuring out why this is happening. Its supposed to be fixed with the latest mkl package, but it aint for me.

export MKL_THREADING_LAYER=GNU

too many open files

Sometimes it'll complain about too many open files inccrease the ulimit

ulimit -n 4048

❤️ QA instructions

checkout to this branch refactor-datasets-usage Run elicit with huggyllama 7b with these variations. For each of the runs, check that the eval.csv are roughly the same. and lmk if it crashes. Note that we are disabling the cache for extracting here. Otherwise subsequent elicit runs won't actually run the extraction with llama, it will just reuse it.

With fsdp. This shards the model on each device.

elk elicit huggyllama/llama-7b imdb --fsdp_enabled --num_gpus 2 --disable_cache

Without fsdp, but multigpu. This duplicates the model on each device

elk elicit huggyllama/llama-7b imdb ----num_gpus 2 --disable_cache

Now compare this to the main branch. Does llama-7b take significantly slower?

If the above seems to work without crashing, and if you are feeling ambitious. you can merge in the latest changes into this branchand fix the conflicts. may be confusing though.

EleutherAI / elk