Open ycouble opened 4 months ago
@samuelrince Maybe we can discuss this together with @inimaz
Yes, good point, and it's the same question in the python package itself. How can we sync updates of models.csv without needing to release a new version and also keep it local first.
If you have thought of a process or an architecture, we can work on that for both libs.
It heavily depends on how often models.csv
is going to be updated.
Another alternative to the submodule: we could generate the models.csv
file per release, i.e. fetch it in the ci, save it in a file and include it in the package at build time.
That good if we can trigger a new release for each release of the python lib.
The drawback of the release approach is that users will have to upate their dependencies to get the newest models, vs just a code update if its dynamic
Le mar. 25 juin 2024 à 17:49, inimaz @.***(mailto:Le mar. 25 juin 2024 à 17:49, inimaz < a écrit :
It heavily depends on how often models.csv is going to be updated.
Another alternative to the submodule: we could generate the models.csv file per release, i.e. fetch it in the ci, save it in a file and include it in the package at build time.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
I like the idea of having the model_repository in its own git repository with its own release cycle @inimaz.
For each release of the model_rep we generate a "database file" that contains all the models and hypotheses we used. This file can be stored in the Git repo and accessed through GitHub API. (all free of charge at the beginning)
Each time a new release of a client (ecologits or ecologits.js) happens, we inject the latest version of model_rep in CI. Plus, we can add a mechanism to check (max once a day?) if a new version of model_rep is available directly in the clients (with opt-out flag available).
Thus, we can always have the latest version of model_rep without an update of the clients.
This can also help us increase transparency on our hypotheses as well. E.g. if we change the parameters for gpt-4 through time, users can clearly see when we make the update and why.
If we go for that solution, I would consider updating the format as well to make it more flexible and probably generate a JSON file.
Example format (to challenge):
[
{
"type": "model",
"provider": "openai",
"name": "gpt-3.5-turbo-01-0125",
"architecture": {
"type": "dense",
"parameters": {
"min": 20,
"max": 70
}
},
"warnings": [
"model_achitecture_not_released"
],
"sources": [
"https://platform.openai.com/docs/models/gpt-3-5-turbo",
"https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit"
],
},
{
"type": "alias",
"provider": "openai",
"name": "gpt-3.5-turbo",
"alias": "gpt-3.5-turbo-01-0125"
},
{
"type": "model",
"provider": "openai",
"name": "gpt-4-turbo-2024-04-09",
"architecture": {
"type": "moe",
"parameters": {
"total": 880,
"active": {
"min": 110,
"max": 440
}
}
},
"warnings": [
"model_achitecture_not_released",
"model_achitecture_multimodal"
],
"sources": [
"https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4",
"https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit"
]
},
{
"type": "model",
"provider": "mistralai",
"name": "open-mistral-7b",
"architecture": {
"type": "dense",
"parameters": 7.3
},
"warnings": null,
"sources": [
"https://docs.mistral.ai/models/#sizes "
]
}
]
Sorry to just barge into this conversation, have a few pointers that might hopefully be useful.
I like the idea of having the model_repository in its own git repository with its own release cycle @inimaz.
If we're expecting the database file to update often (e.g. more than once a week), then yes, moving it into a separate repository will help decouple the release cycles of ecologits and ecologits.js from the database file.
This file can be stored in the Git repo and accessed through GitHub API. (all free of charge at the beginning)
Probably won't need GitHub API to pull the file - you can pull the raw version of the file from the repo directly like this: models.csv (no cost or login credentials required for this 😌).
If we go for that solution, I would consider updating the format as well to make it more flexible and probably generate a JSON file.
That'll be awesome because it may make supporting dynamic fields based on the model type easier (4th point in description here).
Probably won't need GitHub API to pull the file - you can pull the raw version of the file from the repo directly like this: models.csv (no cost or login credentials required for this 😌).
The idea is to have some kind of an API where we can check if the file has changed or not before downloading a new version. Like compare the hash of local vs remote and decide if we need to update or not.
Like compare the hash of local vs remote and decide if we need to update or not.
This is a perfect use case for etag
. GitHub sends the etag
header in the raw file response, which contains a hash denoting the current file version.
File version check logic might look something like this:
etag
hash value.HEAD
request to GitHub to only get the file response headers without the actual file (sample curl below).etag
header hash values from steps 1 & 2 match, if they match then do nothing, we've the latest file already.GET
request to the same file URL and then we also update our local etag
hash value like in step 1.Sample curl:
curl -X HEAD -I https://raw.githubusercontent.com/genai-impact/ecologits/main/ecologits/data/models.csv
In the sample curl output you'll see that the current models.csv has the etag currently set to 41a68510227fa2c99cf9d7f6635abd16f4a672e2719ba95eca1b70de5496caf9
.
More info on etag
header in MDN docs here.
There is a tradeoff between fetching a remote file and have a descynchronized copy locally.
Solutions: