ersilia-os / ersilia

The Ersilia Model Hub, a repository of AI/ML models for infectious and neglected disease research.
https://ersilia.io
GNU General Public License v3.0
189 stars 124 forks source link

[🐕 Batch]: Sanitize AirTable in a cron job #768

Open miquelduranfrigola opened 11 months ago

miquelduranfrigola commented 11 months ago

We have started to work on the following class to sanitize AirTable periodically:

https://github.com/ersilia-os/ersilia/blob/master/ersilia/db/hubdata/sanitize.py

This needs to be populated with checks. A related cron job will be:

https://github.com/ersilia-os/ersilia/blob/master/.github/workflows/sanitize-hub.yml

miquelduranfrigola commented 8 months ago

Hello @carcablop - I think you discussed with @GemmaTuron about this?

GemmaTuron commented 8 months ago

Hi @miquelduranfrigola and @carcablop

What we discussed is that the first step will be to create a simple GitHub Action that will periodically convert the Airtable backend into a .csv file - easier to check and manipulate. That is not an urgent task, so we will tackle it when possible

GemmaTuron commented 7 months ago

Update: @carcablop is working on that feature, thanks! 🎉

miquelduranfrigola commented 6 months ago

Hi! May I ask about the current status of this?

miquelduranfrigola commented 6 months ago

Quick update. I have created an "scan models" in the ersilia-stats repository that may be (partially) related to this issue. In brief, this action looks for models that have a repository in the ersilia-os organization profile but, nonetheless, they are not registered in AirTable. For each of these models, an issue is opened with a "maintenance" label. For now, I've put this action in the ersilia-stats repository, but feel free to move it to the ersilia repo if you think it makes more sense.

miquelduranfrigola commented 5 months ago

Hi @carcablop and @GemmaTuron ! Did we make any progress on this front? Let me know if there is anything I can do to help.

miquelduranfrigola commented 5 months ago

Hi @miquelduranfrigola and @carcablop

What we discussed is that the first step will be to create a simple GitHub Action that will periodically convert the Airtable backend into a .csv file - easier to check and manipulate. That is not an urgent task, so we will tackle it when possible

Specifically, I would start with this very simple suggestion by @GemmaTuron

carcablop commented 5 months ago

Hello @miquelduranfrigola. Apologies for not having responded sooner. I have made a little progress on this, I created the YAML file for the workflow and a script that operates converting the backend to a csv file, although I have not tested it. Please allow me to update my changes tomorrow, and we can review that. I will manage it in the ersilia repository. Do you want the csv file to be saved in a specific location? Thank you so much.

miquelduranfrigola commented 5 months ago

Thanks @carcablop

I would save the CSV file in this location: https://github.com/ersilia-os/ersilia/tree/master/ersilia/hub/content , perhaps in a subfolder called data? So: ersilia/hub/content/data/models.csv, for example?

carcablop commented 5 months ago

Thanks @miquelduranfrigola Some changes I made: yml file script I'm not sure if the airtable base ID is correct.

miquelduranfrigola commented 5 months ago

Hello @carcablop , I have been looking into the current models.tsv file being generated and the format is not ideal. It seems AirTable cannot be naturally exported to a tabular format, mostly because of the presence of inner lists.

On a second thought, I feel it would be much better to store it in JSON format, which is way more flexible. So, having a models.json instead of a models.tsv. What do you all think, @DhanshreeA @GemmaTuron @carcablop ?

miquelduranfrigola commented 5 months ago

Hello @DhanshreeA @carcablop and @GemmaTuron

I wonder if, in addition to the cron trigger and the workflow manual dispatch, we could use a webhook to trigger the event: https://airtable.com/developers/web/guides/webhooks-api

This would be ideal since it would allow us to get rid of the AirTable interface with the Ersilia CLI (and the read-only key now exposed in: https://github.com/ersilia-os/ersilia/blob/master/config/read_only_keys.json). In practice, users would only interact with the models.json file.

carcablop commented 5 months ago

Hello @carcablop , I have been looking into the current models.tsv file being generated and the format is not ideal. It seems AirTable cannot be naturally exported to a tabular format, mostly because of the presence of inner lists.

On a second thought, I feel it would be much better to store it in JSON format, which is way more flexible. So, having a models.json instead of a models.tsv. What do you all think, @DhanshreeA @GemmaTuron @carcablop ?

Hi @miquelduranfrigola Yes, you are right about this. Additionally, not all the column keys are being obtained either. I agree with using a JSON format, which is easier to manipulate and transform the data. I can make this change, I have already tried it like this before.

carcablop commented 5 months ago

I made the changes, if you agree to save it in a JSON file, this will work.

GemmaTuron commented 5 months ago

Hello @carcablop @miquelduranfrigola

Many models are missing their related airtable entry - since this is the only recent work we have been doing on airtable I suspect there is a problem with the pipeline - I cannot explain it otherwise. I've noted for example eos9taz, and eos4qda missing. Or is the change in the read only key that has allowed some unwanted access @miquelduranfrigola ? This is critical

GemmaTuron commented 5 months ago

ok, Airtable is playing funny. It now shows all the models again (promise, they were not there) but throws back the error we have on the key deprecation issue: ('401 Client Error: Unauthorized for url: https://api.airtable.com/v0/appgxpCzCDNyGjWc8/Models?pageSize=100&maxRecords=100000', '{\'type\': \'API_KEYS_ARE_DEPRECATED\', \'message\': "Airtable API keys are deprecated and can no longer be used. To continue using Airtable\'s API, migrate to personal access tokens or OAuth for integrations: https://support.airtable.com/docs/airtable-api-key-deprecation-notice."}') I do not understand this seemingly random bug

miquelduranfrigola commented 5 months ago

I have found a solution to this through personal access tokens, @GemmaTuron - I hope it works now.

miquelduranfrigola commented 5 months ago

Hello @carcablop @miquelduranfrigola

Many models are missing their related airtable entry - since this is the only recent work we have been doing on airtable I suspect there is a problem with the pipeline - I cannot explain it otherwise. I've noted for example eos9taz, and eos4qda missing. Or is the change in the read only key that has allowed some unwanted access @miquelduranfrigola ? This is critical

Hey @GemmaTuron thanks. I am at a loss here. In which file they are missing? In the .tsv or the .json file?

GemmaTuron commented 5 months ago

they were not on airtable, promise, in any view or without any filter.

miquelduranfrigola commented 5 months ago

OK they are there now.

DhanshreeA commented 5 months ago

Quick update. I have created an "scan models" in the ersilia-stats repository that may be (partially) related to this issue. In brief, this action looks for models that have a repository in the ersilia-os organization profile but, nonetheless, they are not registered in AirTable. For each of these models, an issue is opened with a "maintenance" label. For now, I've put this action in the ersilia-stats repository, but feel free to move it to the ersilia repo if you think it makes more sense.

Oh I believe some of these model repositories also correspond to dud repositories created as a result of unsuccessful approve command dispatch.

miquelduranfrigola commented 5 months ago

Hi all, since we are doing this within a GitHub Actions, I am thinking that we should be storing the models.json file in AWS S3 instead of GitHub. We have the credentials to write to S3 already as a GitHub secret, so this should not be a big issue. The file would be publicly accessible and equally easy to maintain in S3, I think.

@DhanshreeA do you agree? Any strong reason not to do so?

miquelduranfrigola commented 5 months ago

As for the "interface" to the json file, we could simply have it here: https://github.com/ersilia-os/ersilia/blob/master/ersilia/db/hubdata/interfaces.py

At the moment, we have an AirtableInterface class.

We could write something similar:

class JsonModelsInterface(...):
    def __init__(...):
        ...

    def _read_json_file(...):
        ...

    def items(...):
        for item in data:
            yield item

    def all_items(...):
        return 

We could then use this class as a drop-in replacement for AirtableInterface

miquelduranfrigola commented 4 months ago

Hi @carcablop,

To keep us all on the same page, could you please summarize the next steps based on Friday's meeting?

Thankyou!

carcablop commented 4 months ago

Hi @miquelduranfrigola @DhanshreeA and @GemmaTuron . In summary, to contextualize, starting from what is currently incorporated in ersilia, we have an action that allows us to download Airtable data every night and save it in a tsv file. This task was changed due to the complexity of managing the data in tsv, therefore the following was proposed:

In the last meeting, we discussed some of the problems that we would have when trying to eliminate the dependency of Airtable with Ersilia, and that is that starting from the fact that we have a cron that is updated every night and additionally, data is also uploaded to Airtable, for example when We load a model, we are going to present a synchronization problem in the information since in the automation of the workflow that model will not be found due to the lack of immediate update or synchronization at the same time of update and upload to Airtable and S3. In short, we would be having an information synchronization problem. That is why the following is proposed:

  1. Using webhooks to detect Airtable changes.
  2. Possibility of modifying some scripts where the metadata is updated to Airtable. Possibly consider modifying this script "update_metadata_to_airtable.py" that it inserts metadata at the same time as uploading the information to S3.

For now, the solution of modifying scripts is easier compared to using webhooks.

For now, we will continue working on the following:

miquelduranfrigola commented 4 months ago

Thanks @carcablop , this is very useful. Agree that webhooks is too complicated. Let's not do it.

DhanshreeA commented 4 months ago

Sounds good @carcablop , let me know if I can assist in any way.

GemmaTuron commented 3 months ago

Hi @DhanshreeA @carcablop @miquelduranfrigola

Should we take this up again? The tsv file is still there and gets updated every day making the git pulls and pushes unnecessarily long

carcablop commented 3 months ago

Hi @DhanshreeA @carcablop @miquelduranfrigola

Should we take this up again? The tsv file is still there and gets updated every day making the git pulls and pushes unnecessarily long

Hello @GemmaTuron Apologize for the git pulls and pushes unnecessarily for a long time. With the last PR https://github.com/ersilia-os/ersilia/pull/1102, the models will not be saved in a .tsv file, the models should be uploaded to S3 as a .json file without the need to push changes to the repository. @DhanshreeA @miquelduranfrigola I would like to validate if the data of the models was saved in S3 or if the one that already exists was updated since I cannot validate it because I do not have access. In the last action run yesterday https://github.com/ersilia-os/ersilia/actions/runs/8698881741/job/23856552720 it did not throw any issues and file 'models.json' was uploaded. Thanks you

miquelduranfrigola commented 3 months ago

Thanks @carcablop - @DhanshreeA can you take care of this? Thanks!