datagouv / hydra

Async metadata crawler for data.gouv.fr
4 stars 0 forks source link

hydra: Manage large resources exceptions differently #114

Open geoffreyaldebert opened 1 month ago

geoffreyaldebert commented 1 month ago

Today, we manage resource_id exceptions through a config file.

I propose to manage it from a postgres table instead. So :

These kind of tables are heavy, so we want to optimize their future queries. So we have to add indexes. We can add indexes into the table tables_index in hydra-hydra-csv with a new column indexes that is empty for classical resource and a list of column names for heavy files. So :

It will be very nice to add an exception through a POST query to hydra, for instance :

POST https://crawler.data.gouv.fr/api/add-exception
{
    "resource_id" : "XXXX",
    "indexes": ["COL1", "COL32", "COL43"]
}
bolinocroustibat commented 3 weeks ago

Note: this should replace config.LARGE_RESOURCES_EXCEPTIONS

bolinocroustibat commented 2 weeks ago

PR: https://github.com/datagouv/hydra/pull/148

bolinocroustibat commented 2 weeks ago

Question: would it be more performant to do a SQL query using a SQL function to check if the resource id is in the table, instead of getting them and doing a Python comparison?