eole-nlp / eole

Open language modeling toolkit based on PyTorch
https://eole-nlp.github.io/eole
MIT License
53 stars 11 forks source link

Inference server, lots of related changes #42

Closed francoishernandez closed 2 weeks ago

francoishernandez commented 3 months ago

This is a very first draft for a simple fastAPI based inference server. Not much but will be a first base to iterate on.

Key concepts/changes

Some short-term TODOs

Some nice-to-haves

francoishernandez commented 2 weeks ago

238ab22 -> mapped_tokens are retrieved from HF's added_tokens (special_tokens_map.json)

TODO:

francoishernandez commented 2 weeks ago

We can probably merge this. The server in itself works. It needs some improvement (gpu/memory model management, error handling, etc.) but all that can be added iteratively. Also, this PR fixes a few annoying things, such as the unnecessary "gpu" inference flag, and moves towards better support of llama-style placeholder tokens and chat templates. (Note: eos_token patch in convert_HF is quite fishy, but #45 should make it better.) Bumping to 0.0.2/0.1.0 after merging might not hurt for clarity. (Maybe first 0.0.2, and 0.1.0 will be after finalizing #45.)

francoishernandez commented 2 weeks ago

d2fd18f aligns the behaviour of converted and trained models : transforms_configs of a trained model are adapted to facilitate loading of corresponding artifacts. E.g. when training model using "long/path/to/subwords.bpe", this model will be saved to the model's directory as "subwords.bpe" and the transform config in config.json will be updated to "${MODEL_PATH}/subwords.bpe", allowing transparent loading when predicting (or later finetuning). This is quite a nice step towards simplifying the whole config/command management from a user pov, as now we will be able to run inference via a simple command line, even with complex transforms.

francoishernandez commented 2 weeks ago

fe8e8d7 -> when calling the infer entrypoint on a new model, unload any model that is already loaded before loading the new one to prevent potential conflict. More specific logic (multi-model, multi-gpu, memory limits, etc.) can be implemented later depending on use cases.