Improve the dependency to the models.csv file from ecologits

ycouble commented 4 months ago

There is a tradeoff between fetching a remote file and have a descynchronized copy locally.

Solutions:

caching (if remote)
submodule
...

ycouble commented 4 months ago

@samuelrince Maybe we can discuss this together with @inimaz

samuelrince commented 4 months ago

Yes, good point, and it's the same question in the python package itself. How can we sync updates of models.csv without needing to release a new version and also keep it local first.

If you have thought of a process or an architecture, we can work on that for both libs.

inimaz commented 4 months ago

It heavily depends on how often models.csv is going to be updated.

Another alternative to the submodule: we could generate the models.csv file per release, i.e. fetch it in the ci, save it in a file and include it in the package at build time.

ycouble commented 4 months ago

That good if we can trigger a new release for each release of the python lib.

The drawback of the release approach is that users will have to upate their dependencies to get the newest models, vs just a code update if its dynamic

Le mar. 25 juin 2024 à 17:49, inimaz @.***(mailto:Le mar. 25 juin 2024 à 17:49, inimaz < a écrit :

It heavily depends on how often models.csv is going to be updated.

Another alternative to the submodule: we could generate the models.csv file per release, i.e. fetch it in the ci, save it in a file and include it in the package at build time.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

samuelrince commented 4 months ago

I like the idea of having the model_repository in its own git repository with its own release cycle @inimaz.

For each release of the model_rep we generate a "database file" that contains all the models and hypotheses we used. This file can be stored in the Git repo and accessed through GitHub API. (all free of charge at the beginning)

Each time a new release of a client (ecologits or ecologits.js) happens, we inject the latest version of model_rep in CI. Plus, we can add a mechanism to check (max once a day?) if a new version of model_rep is available directly in the clients (with opt-out flag available).

Thus, we can always have the latest version of model_rep without an update of the clients.

This can also help us increase transparency on our hypotheses as well. E.g. if we change the parameters for gpt-4 through time, users can clearly see when we make the update and why.

If we go for that solution, I would consider updating the format as well to make it more flexible and probably generate a JSON file.

Example format (to challenge):

[
    {
        "type": "model",
        "provider": "openai",
        "name": "gpt-3.5-turbo-01-0125",
        "architecture": {
            "type": "dense",
            "parameters": {
                "min": 20,
                "max": 70
            }
        },
        "warnings": [
            "model_achitecture_not_released"
        ],
        "sources": [
            "https://platform.openai.com/docs/models/gpt-3-5-turbo",
            "https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit"
        ],
    },
    {
        "type": "alias",
        "provider": "openai",
        "name": "gpt-3.5-turbo",
        "alias": "gpt-3.5-turbo-01-0125"
    },
    {
        "type": "model",
        "provider": "openai",
        "name": "gpt-4-turbo-2024-04-09",
        "architecture": {
            "type": "moe",
            "parameters": {
                "total": 880,
                "active": {
                    "min": 110,
                    "max": 440
                }
            }
        },
        "warnings": [
            "model_achitecture_not_released",
            "model_achitecture_multimodal"
        ],
        "sources": [
            "https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4",
            "https://docs.google.com/spreadsheets/d/1O5KVQW1Hx5ZAkcg8AIRjbQLQzx2wVaLl0SqUu-ir9Fs/edit"
        ]
    },
    {
        "type": "model",
        "provider": "mistralai",
        "name": "open-mistral-7b",
        "architecture": {
            "type": "dense",
            "parameters": 7.3
        },
        "warnings": null,
        "sources": [
            "https://docs.mistral.ai/models/#sizes "
        ]
    }
]

omkar-foss commented 2 months ago

Sorry to just barge into this conversation, have a few pointers that might hopefully be useful.

I like the idea of having the model_repository in its own git repository with its own release cycle @inimaz.

If we're expecting the database file to update often (e.g. more than once a week), then yes, moving it into a separate repository will help decouple the release cycles of ecologits and ecologits.js from the database file.

This file can be stored in the Git repo and accessed through GitHub API. (all free of charge at the beginning)

Probably won't need GitHub API to pull the file - you can pull the raw version of the file from the repo directly like this: models.csv (no cost or login credentials required for this 😌).

If we go for that solution, I would consider updating the format as well to make it more flexible and probably generate a JSON file.

That'll be awesome because it may make supporting dynamic fields based on the model type easier (4th point in description here).

samuelrince commented 2 months ago

Probably won't need GitHub API to pull the file - you can pull the raw version of the file from the repo directly like this: models.csv (no cost or login credentials required for this 😌).

The idea is to have some kind of an API where we can check if the file has changed or not before downloading a new version. Like compare the hash of local vs remote and decide if we need to update or not.

omkar-foss commented 2 months ago

Like compare the hash of local vs remote and decide if we need to update or not.

This is a perfect use case for etag. GitHub sends the etag header in the raw file response, which contains a hash denoting the current file version.

File version check logic might look something like this:

On first load, we download the file and save it's etag hash value.
We make a HEAD request to GitHub to only get the file response headers without the actual file (sample curl below).
Then we can check if etag header hash values from steps 1 & 2 match, if they match then do nothing, we've the latest file already.
If they don't match, it indicates a new file version, and we initiate a download of the new file by making a GET request to the same file URL and then we also update our local etag hash value like in step 1.

Sample curl:

curl -X HEAD -I https://raw.githubusercontent.com/genai-impact/ecologits/main/ecologits/data/models.csv

In the sample curl output you'll see that the current models.csv has the etag currently set to 41a68510227fa2c99cf9d7f6635abd16f4a672e2719ba95eca1b70de5496caf9.

More info on etag header in MDN docs here.

genai-impact / ecologits.js

Improve the dependency to the models.csv file from ecologits #4