clld / glottolog3

glottolog2 re-implemented as CLLD app
MIT License
110 stars 27 forks source link

Tighten uniqueness of refprovider join-table? #91

Closed xflr6 closed 1 year ago

xflr6 commented 6 years ago

Not sure whether we intend to include the id column (which is always unique) within the uniqueness here: https://github.com/clld/glottolog3/blob/c0ed353dacfb62cd42bf779c0874878a023dfeed/glottolog3/models.py#L126-L130 Looks like this in the table (use the id column as another scrtrickle?):

SELECT ref_pk, provider_pk, id
FROM refprovider AS rp
WHERE EXISTS (SELECT 1 FROM refprovider
  WHERE ref_pk = rp.ref_pk AND provider_pk = rp.provider_pk AND id != rp.id)
ORDER BY ref_pk, provider_pk, id
ref_pk |provider_pk |id                         |
-------|------------|---------------------------|
50     |31          |silpng:silpng1991-2000:462 |
50     |31          |silpng:silpng1991-2000:463 |
187    |31          |silpng:silpng1956-1990:805 |
187    |31          |silpng:silpng1956-1990:806 |
187    |31          |silpng:silpng1956-1990:807 |
187    |31          |silpng:silpng1956-1990:808 |
187    |31          |silpng:silpng1956-1990:809 |
187    |31          |silpng:silpng1956-1990:810 |
187    |31          |silpng:silpng1956-1990:811 |
187    |31          |silpng:silpng1956-1990:812 |

A small number of refs gets many provider links this way:

SELECT nprov, count(*)
FROM (
  SELECT count(*) AS nprov
  FROM refprovider
  GROUP BY ref_pk
) AS stats
GROUP BY nprov
ORDER BY nprov
nprov |count  |
------|-------|
1     |272485 |
2     |34406  |
3     |8556   |
4     |3213   |
5     |1079   |
6     |397    |
7     |174    |
8     |67     |
9     |31     |
10    |29     |
11    |16     |
12    |8      |
13    |17     |
14    |10     |
15    |7      |
16    |8      |
17    |6      |
18    |10     |
19    |3      |
20    |7      |
21    |1      |
22    |2      |
23    |2      |
24    |5      |
27    |2      |
29    |1      |
30    |3      |
31    |1      |
33    |1      |
36    |1      |
40    |1      |
43    |1      |
44    |1      |
48    |1      |
49    |2      |
58    |1      |
60    |1      |
72    |1      |
93    |1      |
175   |1      |
xrotwang commented 2 years ago

The core problem seems to be incomplete bibliographical metadata, which tricks the merging algo into thinking a lot of refs are the same, e.g.

@book{silpng1991-2000:462,
    author = {___ ___ and trans. ___ and trans.},
    title = {I-unöý öýgöbin [preschool primer 3,4]},
    pages = {42},
    year = {1994},
    glottolog_ref_id = {158690},
    lgcode = {Bunama [bdd]},
    macro_area = {Papua}
}
@book{silpng1991-2000:463,
    author = {___ ___ and trans. ___ and trans.},
    title = {I-unöý öýgöbin [preschool primer 5]},
    pages = {32},
    year = {1994},
    glottolog_ref_id = {158690},
    lgcode = {Bunama [bdd]},
    macro_area = {Papua}
}

Now if we want to be able to resolve all local bibkeys (even multiple ones for the same provider and the same merged ref), we have to keep the unique constraint as loose as it is, it seems.

xrotwang commented 2 years ago

AFAICT, at least the web app does the right thing here: https://glottolog.org/resource/reference/id/158690