Closed francoishernandez closed 2 weeks ago
238ab22 -> mapped_tokens are retrieved from HF's added_tokens (special_tokens_map.json)
TODO:
inference.json
to main config.json
to prevent demultiplicating files;We can probably merge this. The server in itself works. It needs some improvement (gpu/memory model management, error handling, etc.) but all that can be added iteratively. Also, this PR fixes a few annoying things, such as the unnecessary "gpu" inference flag, and moves towards better support of llama-style placeholder tokens and chat templates. (Note: eos_token patch in convert_HF is quite fishy, but #45 should make it better.) Bumping to 0.0.2/0.1.0 after merging might not hurt for clarity. (Maybe first 0.0.2, and 0.1.0 will be after finalizing #45.)
d2fd18f aligns the behaviour of converted and trained models : transforms_configs of a trained model are adapted to facilitate loading of corresponding artifacts.
E.g. when training model using "long/path/to/subwords.bpe"
, this model will be saved to the model's directory as "subwords.bpe" and the transform config in config.json
will be updated to "${MODEL_PATH}/subwords.bpe"
, allowing transparent loading when predicting (or later finetuning).
This is quite a nice step towards simplifying the whole config/command management from a user pov, as now we will be able to run inference via a simple command line, even with complex transforms.
fe8e8d7 -> when calling the infer
entrypoint on a new model, unload any model that is already loaded before loading the new one to prevent potential conflict. More specific logic (multi-model, multi-gpu, memory limits, etc.) can be implemented later depending on use cases.
This is a very first draft for a simple fastAPI based inference server. Not much but will be a first base to iterate on.
Key concepts/changes
transforms
andtransforms_configs
are saved in aninference.json
config file within the model directory, for transparent loading + tentative adaptation ofconvert_HF
to grab everything transparently;DecodingConfig
;random_sampling_topk/p
->top_k/p
,random_sampling_temp
->temperature
) and homogenization across the code;gpu
flag in PredictConfig, duplicate withworld_size
/gpu_ranks
(might still be improved though).Some short-term TODOs
Some nice-to-haves