Wikidata / soweego

Link Wikidata items to large catalogs
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
GNU General Public License v3.0
98 stars 9 forks source link

Find or propose a Wikidata property for confidence scores #220

Open marfox opened 5 years ago

marfox commented 5 years ago

For probabilistic output, it would be optimal for the bot to add a qualifier with a float value, representing the confidence score of a given statement.

fracorco commented 5 years ago

To assess the possibility of reusing an existing qualifier property, I went through all of them using SPARQL query:

select *
where {
  ?p wdt:P31 wd:Q15720608;
     rdfs:label ?l;
     schema:description ?d.
  filter (lang(?l) = "en")
  filter (lang(?d) = "en")
  optional { ?p wikibase:propertyType ?t }
}

Relevant properties are:

id label description type
wd:P1107 proportion to be used as a qualifier, value must be between 0 and 1 wikibase:Quantity
wd:P4271 rating qualifier to indicate a score given by the referenced source indicating the quality or completeness of the statement wikibase:WikibaseItem
wd:P1480 sourcing circumstances qualification of the truth or accuracy of a source: circa (Q5727902), near (Q21818619), presumably (Q18122778), etc. wikibase:WikibaseItem
wd:P2571 uncertainty corresponds to number of standard deviations (sigma) expressing the confidence level of a value wikibase:WikibaseItem

For these qualifiers, I extracted the number of triples and the most frequent properties they are used with, and values they take, using query (replace three occurrences of P1107 with each one of the properties above):

select ?triples ?properties ?values {
  { select (count(*) as ?triples) { ?s pq:P1107 ?o } }
  { select (group_concat(?v; separator="; ") as ?properties) {
      {
        select ?p ?l (count(*) as ?n) {
          ?e ?p ?s . ?s pq:P4271 ?o .
          optional { ?pe wikibase:claim ?p ; rdfs:label ?l filter(lang(?l) = "en") }
        }
        group by ?p ?l order by desc(?n) limit 10
      }
      bind (concat(strafter(str(?p), "http://www.wikidata.org/prop/"),
            " (", ?l, " - ", str(?n), ")") as ?v)
    }
  }
  { select (group_concat(?v; separator="; ") as ?values) {
      {
        select ?o ?l (count(*) as ?n) {
          ?s pq:P1107 ?o .
          optional { ?o rdfs:label ?l filter(lang(?l) = "en") }
        }
        group by ?o ?l order by desc(?n) limit 10
      }
      bind (concat(coalesce(?l, str(?o)), " (", str(?n), ")") as ?v)
    }
  }
}
qualifier triples properties values
pq:P1107 9448 P1344 (participant of - 1); P444 (review score - 3); P3357 (negative diagnostic predictor - 6); P3358 (positive prognostic predictor - 76); P3356 (positive diagnostic predictor - 95); P3359 (negative prognostic predictor - 244); P3355 (negative therapeutic predictor - 619); P3354 (positive therapeutic predictor - 1016) 0.8 (56); 0.3 (57); 0.9 (65); 0.4 (67); 100 (68); 0.2 (83); 0.25 (88); 0.1 (97); 0.5 (253); 1 (3916)
pq:P4271 2060 P1344 (participant of - 1); P444 (review score - 3); P3357 (negative diagnostic predictor - 6); P3358 (positive prognostic predictor - 76); P3356 (positive diagnostic predictor - 95); P3359 (negative prognostic predictor - 244); P3355 (negative therapeutic predictor - 619); P3354 (positive therapeutic predictor - 1016) UEFA stadium categories (1); D (1); Charity Navigator four-star rating (2); CIViC 1-star trust rating (96); CIViC 5-star trust rating (104); CIViC 4-star trust rating (469); CIViC 2-star trust rating (497); CIViC 3-star trust rating (890)
pq:P1480 63354 P19 (place of birth - 329); P1014 (AAT ID - 343); P2044 (elevation above sea level - 486); P170 (creator - 585); P276 (location - 662); P2031 (work period (start) - 705); P31 (instance of - 837); P570 (date of death - 4762); P569 (date of birth - 11954); P571 (inception - 37547) attribution (62); unspecified calendar (186); fiscal year (259); possibly (452); possibly approximate value (503); hierarchical link is not direct (550); disputed (758); near (820); presumably (3124); circa (55854)
pq:P2571 11233 P2374 (natural abundance - 1); P2201 (electric dipole moment - 1); P1855 (Wikidata property example - 1); P577 (publication date - 2); P2102 (boiling point - 8); P2101 (melting point - 9); P2114 (half-life - 905); P2160 (mass excess - 3435); P2067 (mass - 3435); P2154 (binding energy - 3436) Long Term Evolution (1); 2 sigma (1); 8 (1); 1 (1); 5 (2); expanded uncertainty (15); standard deviation (11212)

Based on the tables above, it seems that:

Summing up, none of the properties above seems reusable as it is. We can probably propose a variation of one of them, and especially wd:P4271 or wd:P1480.

marfox commented 5 years ago

Thanks a lot @fracor for the thorough analysis, much appreciated. My understanding is that none of the existing properties you listed fit our use case.

I suggest to go for a property proposal. Any suggestions for the label of the new property are welcome.

Remper commented 5 years ago

label: confidence score description: a score interpretable as a probability estimate (from 0 to 1) given by the referenced source indicating the quality of the statement.