113 model dl - Githubissues

greenw0lf commented 1 month ago

So, what I propose when it comes to determining the model used is the following:

check_model_availability now becomes get_model_location
this 'new' func returns a str that contains either the path to the model (if W_MODEL is an HTTP/S3 URI) or W_MODEL (if W_MODEL is a pretrained model version, such as large-v2 or tiny)
if W_MODEL is an URI, it will attempt to download it (it is expected to be all zipped in a .tar.gz file) and save it in the /model folder under a folder with the same name as the zip/tar file that was downloaded
- e.g: if the model downloaded is named whisper_custom.tar.gz, the files will be extracted under /model/whisper_custom.tar/
there is no longer support for local models aka model files that already exist in /model
- in order to avoid mismatch between W_MODEL and the model that was actually used, otherwise provenance would report wrong info

Let me know if something is missing or isn't explained properly

jblom commented 1 month ago

@greenw0lf k will check this out.

jblom commented 1 month ago

@greenw0lf I refactored a bit the code here:

split up the get_model_location function a bit, to make it a bit more readable.
made sure all the model download functions are functional in the sense that they do not expect there to be global variables (i.e. w_model and model_base_dir imported from the cfg). This way the functions can be tested more easily.
added FIXME statements to add a bit more of error handling. Now the system will crash if model_location is "".
so also add a bit of error handling in the extract_model function (e.g. if untar fails, return "")

Won't mind merging this sooner, but the error handling would be good to add before that. Unit tests I leave up to you, but since the functions are now quite isolated they should be easy to test

jblom commented 1 month ago

@greenw0lf oh yeah, could you also extend the main function call with a way to just download the model? This way we can also reuse the same docker image to just download the model into a shared volume (and after that startup one or more whisper services)

beeldengeluid / whisper-asr-worker

113 model dl #116