aiondemand / AIOD-rest-api

Services for the core of AIoD: Authentication and the metadata catalogue with REST API.
https://api.aiod.eu
MIT License
10 stars 7 forks source link

Multiple HuggingFace identifiers can point to the same dataset #385

Open PGijsbers opened 7 hours ago

PGijsbers commented 7 hours ago

E.g. both links below resolve to the NaiveDev version of the dataset:

I think this is because the user changed their name (aha-org doesn't exist anymore). Yet in AIoD, metadata for each is registered independently, and so they are counted as two datasets. I feel like this is not what we want, and we should update the connector to reflect that. Logging this here, with the intention to discuss this Thursday Nov 7.

PGijsbers commented 7 hours ago

The number of duplicate registrations may be relatively low, using the publication date as a proxy:

+--------------------------------+
| count(distinct date_published) |
+--------------------------------+
|                         237955 |
+--------------------------------+
1 row in set (2.12 sec)

mysql> select count(*) from dataset where platform="huggingface";
+----------+
| count(*) |
+----------+
|   250897 |
+----------+
1 row in set (0.24 sec)

We should also check if there are other assumptions that may be incorrect, e.g. can a user also change the name of the dataset? In any case, while we should also challenge the assumption that the publication date remains static throughout the lifetime of the dataset, if we assume this to be true, then it should put an upperbound of erroneous double indexing at ~13k regardless of the source of this duplication.