Inference server, lots of related changes

francoishernandez commented 3 months ago

This is a very first draft for a simple fastAPI based inference server. Not much but will be a first base to iterate on.

Key concepts/changes

transforms and transforms_configs are saved in an inference.json config file within the model directory, for transparent loading + tentative adaptation of convert_HF to grab everything transparently;
prediction settings are transparently supported in requests, via the inheritance ofDecodingConfig;
support of dynamic settings (updated in the predictor for each request), e.g. temperature/top_p, etc. (might not be super robust, but works for now);
renaming of random sampling related flags (random_sampling_topk/p -> top_k/p, random_sampling_temp -> temperature) and homogenization across the code;
getting rid of the gpu flag in PredictConfig, duplicate with world_size/gpu_ranks(might still be improved though).

Some short-term TODOs

[ ] proper support of GPU assignment, model loading/unloading;
[x] prompt template support + OpenAI-like chat completion API;
[x] allow configuration of some model level settings (e.g. quantization);

Some nice-to-haves

streaming support (requires significant adaptations in inference_engine and underlying codepaths);
lightweight docker image;
some nice caching mechanisms (e.g. https://github.com/rustformers/llm/pull/14);
CT2 format support once conversion is manageable;
dynamic batching?;

francoishernandez commented 2 weeks ago

238ab22 -> mapped_tokens are retrieved from HF's added_tokens (special_tokens_map.json)

TODO:

[x] move stuff from inference.json to main config.json to prevent demultiplicating files;
[x] load basis inference config in all inference paths (server/predict)
[x] prompt template support (retrieve jinja template from HF)

francoishernandez commented 2 weeks ago

We can probably merge this. The server in itself works. It needs some improvement (gpu/memory model management, error handling, etc.) but all that can be added iteratively. Also, this PR fixes a few annoying things, such as the unnecessary "gpu" inference flag, and moves towards better support of llama-style placeholder tokens and chat templates. (Note: eos_token patch in convert_HF is quite fishy, but #45 should make it better.) Bumping to 0.0.2/0.1.0 after merging might not hurt for clarity. (Maybe first 0.0.2, and 0.1.0 will be after finalizing #45.)

francoishernandez commented 2 weeks ago

d2fd18f aligns the behaviour of converted and trained models : transforms_configs of a trained model are adapted to facilitate loading of corresponding artifacts. E.g. when training model using "long/path/to/subwords.bpe", this model will be saved to the model's directory as "subwords.bpe" and the transform config in config.json will be updated to "${MODEL_PATH}/subwords.bpe", allowing transparent loading when predicting (or later finetuning). This is quite a nice step towards simplifying the whole config/command management from a user pov, as now we will be able to run inference via a simple command line, even with complex transforms.

francoishernandez commented 2 weeks ago

fe8e8d7 -> when calling the infer entrypoint on a new model, unload any model that is already loaded before loading the new one to prevent potential conflict. More specific logic (multi-model, multi-gpu, memory limits, etc.) can be implemented later depending on use cases.

eole-nlp / eole