Open PGijsbers opened 7 hours ago
The number of duplicate registrations may be relatively low, using the publication date as a proxy:
+--------------------------------+
| count(distinct date_published) |
+--------------------------------+
| 237955 |
+--------------------------------+
1 row in set (2.12 sec)
mysql> select count(*) from dataset where platform="huggingface";
+----------+
| count(*) |
+----------+
| 250897 |
+----------+
1 row in set (0.24 sec)
We should also check if there are other assumptions that may be incorrect, e.g. can a user also change the name of the dataset? In any case, while we should also challenge the assumption that the publication date remains static throughout the lifetime of the dataset, if we assume this to be true, then it should put an upperbound of erroneous double indexing at ~13k regardless of the source of this duplication.
E.g. both links below resolve to the
NaiveDev
version of the dataset:I think this is because the user changed their name (aha-org doesn't exist anymore). Yet in AIoD, metadata for each is registered independently, and so they are counted as two datasets. I feel like this is not what we want, and we should update the connector to reflect that. Logging this here, with the intention to discuss this Thursday Nov 7.