gbv / cocoda-mappings

concordances, mappings and conversion scripts to create JSKOS mappings
https://coli-conc.gbv.de/concordances/
1 stars 4 forks source link

Include total size of domains #3

Open nichtich opened 7 years ago

nichtich commented 7 years ago

e.g. P2428 has domain human, so add the total number of humans for comparision.

jneubert commented 7 years ago

RePEc Short-ID is a very nice and convincing case for domains. , I've done a bit of research on this, and came across several obstacles for other cases:

  1. Often, you do not have a single class (P31 Q5 for humans), but need to include instances of all subclasses. I did that for geographic location, and it took more than an hour on a well-empowered custom endpoint - so I'd see no chance to do that on the public Wikidata endpoint with it's 60sec timeout. -- Since SQID has computed all these statistics, it perhaps could be taken from there, but I've found no API doc for that.
  2. Sometimes, the domain in Wikidata can be only determined negativly. E.g., the domaiin of abstract concepts, like "financial crisis", could be computed very approximately as "everything which is not a human nor a geographic location nor an organization nor a chemical substance nor a creative work nor a taxon nor ...". The messiness of Wikidata's class hierarchy make this even more unpleasant.
  3. Sometimes, in order to make a meaingful statement, it would be much better to relate some subdomain. E.g. the number of RePEc author in Wikidata is a very small fraction of all humans represented there (and that will not change much, even if the mapping is improved tremendously). It would be more meaningful to relate that to an estimated number of economists in Wikidata (which cannot be computed strictly, because not all ecomomist have an occupaton property).
  4. On the other side of the equation: Sometimes, a KOS is not about one, but about several domains (e.g. VIAF or GND, with persons, institutions, and more, or STW, which has a geographic part as well as parts about abstract concepts). Additionally, these distinctions may not be really clear-cut on the KOS side.
jneubert commented 7 years ago

Thus, the domain is meaningful in two different ways:

a) as a formal restriction (e.g., you want to focus on GND persons only, which means that we have not P227, but P227 for instances of human. This relates to case 4 above). Since there are many variations of this kind of restrictions, depending on the property/KOS, I suppose it should be considered "out of scope" here, or implemented as an extension for a few selected and manually configured use cases.

b) as a general background for comparisons (e.g., in venn diagrams) in order to give an idea of "how much is covered". Perhaps it would make sense to have a separate datastructure for "basic sets" (with title, size and description), and another data structure "base2prop" to relate these "basic sets" to properties or intersections of properties. This could be extended by everybody who is interested via pull requests. The "size" could be just a arbitrary estimate, or it could optionally be expressed as a sparql query which computes the estimate as a wd query result (e.g twice the number of humans with occupation "economist", rounded to full thousands), in order to keep up with the growth of Wikidata. Unfortunately, due to a), all of this would leave out a lot of interesting use cases ...

nichtich commented 7 years ago

I just read my original statement: "total number of humans" is 7.5 billion, that's not the number to compare with. To answer

1) Total numbers can be counted such as SELECT (COUNT(?x) as ?c) { ?x wdt:P31/wdt:P279* wd:Q2221906 } without timeout

2) Yes, so let's start with easy cases

3) Same as 2) to better be provided as intellectual guess.

4a) and 4b) relates to indirect mappings only. For direct mappings (Wikidata-to-KOS) there are always two numbers

I'd start with the size of full KOS and with mapping candidates in Wikidata expressable as SPARQL query because both can be queried from Wikidata. See https://www.wikidata.org/wiki/Q51044 ald property quantity (P1114) for an example.

jneubert commented 7 years ago

Completely agree with your action plan. Wow, the geographic location query took less than 10sec. Amazing, that blazgraph is so much more optimized (perhaps using some cached statistics) than Fuseki. Re. 4 (partial KOS is only relevant for indirect mappings) I suppose you misunderstood my intent. I think there is value in comparing the ammount of e.g., gndo:DifferentiatedPersons to wd:Q5 instances, or of gndo:CorporateBody to Q43229 instances. But that cannot be attached to an item in WD as quantity - which is elegant and brings a huge advantage over custom config files. So let's start with the easy cases.