Open marfox opened 5 years ago
To assess the possibility of reusing an existing qualifier property, I went through all of them using SPARQL query:
select *
where {
?p wdt:P31 wd:Q15720608;
rdfs:label ?l;
schema:description ?d.
filter (lang(?l) = "en")
filter (lang(?d) = "en")
optional { ?p wikibase:propertyType ?t }
}
Relevant properties are:
id | label | description | type |
---|---|---|---|
wd:P1107 | proportion | to be used as a qualifier, value must be between 0 and 1 | wikibase:Quantity |
wd:P4271 | rating | qualifier to indicate a score given by the referenced source indicating the quality or completeness of the statement | wikibase:WikibaseItem |
wd:P1480 | sourcing circumstances | qualification of the truth or accuracy of a source: circa (Q5727902), near (Q21818619), presumably (Q18122778), etc. | wikibase:WikibaseItem |
wd:P2571 | uncertainty corresponds to | number of standard deviations (sigma) expressing the confidence level of a value | wikibase:WikibaseItem |
For these qualifiers, I extracted the number of triples and the most frequent properties they are used with, and values they take, using query (replace three occurrences of P1107
with each one of the properties above):
select ?triples ?properties ?values {
{ select (count(*) as ?triples) { ?s pq:P1107 ?o } }
{ select (group_concat(?v; separator="; ") as ?properties) {
{
select ?p ?l (count(*) as ?n) {
?e ?p ?s . ?s pq:P4271 ?o .
optional { ?pe wikibase:claim ?p ; rdfs:label ?l filter(lang(?l) = "en") }
}
group by ?p ?l order by desc(?n) limit 10
}
bind (concat(strafter(str(?p), "http://www.wikidata.org/prop/"),
" (", ?l, " - ", str(?n), ")") as ?v)
}
}
{ select (group_concat(?v; separator="; ") as ?values) {
{
select ?o ?l (count(*) as ?n) {
?s pq:P1107 ?o .
optional { ?o rdfs:label ?l filter(lang(?l) = "en") }
}
group by ?o ?l order by desc(?n) limit 10
}
bind (concat(coalesce(?l, str(?o)), " (", str(?n), ")") as ?v)
}
}
}
qualifier | triples | properties | values |
---|---|---|---|
pq:P1107 | 9448 | P1344 (participant of - 1); P444 (review score - 3); P3357 (negative diagnostic predictor - 6); P3358 (positive prognostic predictor - 76); P3356 (positive diagnostic predictor - 95); P3359 (negative prognostic predictor - 244); P3355 (negative therapeutic predictor - 619); P3354 (positive therapeutic predictor - 1016) | 0.8 (56); 0.3 (57); 0.9 (65); 0.4 (67); 100 (68); 0.2 (83); 0.25 (88); 0.1 (97); 0.5 (253); 1 (3916) |
pq:P4271 | 2060 | P1344 (participant of - 1); P444 (review score - 3); P3357 (negative diagnostic predictor - 6); P3358 (positive prognostic predictor - 76); P3356 (positive diagnostic predictor - 95); P3359 (negative prognostic predictor - 244); P3355 (negative therapeutic predictor - 619); P3354 (positive therapeutic predictor - 1016) | UEFA stadium categories (1); D (1); Charity Navigator four-star rating (2); CIViC 1-star trust rating (96); CIViC 5-star trust rating (104); CIViC 4-star trust rating (469); CIViC 2-star trust rating (497); CIViC 3-star trust rating (890) |
pq:P1480 | 63354 | P19 (place of birth - 329); P1014 (AAT ID - 343); P2044 (elevation above sea level - 486); P170 (creator - 585); P276 (location - 662); P2031 (work period (start) - 705); P31 (instance of - 837); P570 (date of death - 4762); P569 (date of birth - 11954); P571 (inception - 37547) | attribution (62); unspecified calendar (186); fiscal year (259); possibly (452); possibly approximate value (503); hierarchical link is not direct (550); disputed (758); near (820); presumably (3124); circa (55854) |
pq:P2571 | 11233 | P2374 (natural abundance - 1); P2201 (electric dipole moment - 1); P1855 (Wikidata property example - 1); P577 (publication date - 2); P2102 (boiling point - 8); P2101 (melting point - 9); P2114 (half-life - 905); P2160 (mass excess - 3435); P2067 (mass - 3435); P2154 (binding energy - 3436) | Long Term Evolution (1); 2 sigma (1); 8 (1); 1 (1); 5 (2); expanded uncertainty (15); standard deviation (11212) |
Based on the tables above, it seems that:
wd:P1107
'proportion' is the only property accepting a quantity value (and the 0-1 range would be perfect for us), but it is essentially used to express percentage of possession / composition. wd:P4271
'rating' is used exclusively with properties and 5-star rating values related to the CIViC database (a resource for Clinical Interpretation of Variants in Cancer)wd:P1480
'sourcing circumstances' is also defined as 'accuracy', 'reliability', 'confidence', 'precision', 'certainty', 'validity', 'qualitative valuation', all terms that closely match our needs. It is used however with a variety of properties, i.e., it appears to be domain-general, which is good for us. However, it is used with categorical values whose meaning is rather fuzzy.wd:P2571
'uncertainty corresponds to' has a very precise definition (number of stddev), but unintuitively it takes an Item value, and in almost all cases that value is the constant 'standard deviation'. Besides, it is applied to numerical / date properties, for which standard deviation makes sense.Summing up, none of the properties above seems reusable as it is. We can probably propose a variation of one of them, and especially wd:P4271
or wd:P1480
.
Thanks a lot @fracor for the thorough analysis, much appreciated. My understanding is that none of the existing properties you listed fit our use case.
I suggest to go for a property proposal. Any suggestions for the label of the new property are welcome.
label: confidence score description: a score interpretable as a probability estimate (from 0 to 1) given by the referenced source indicating the quality of the statement.
For probabilistic output, it would be optimal for the bot to add a qualifier with a float value, representing the confidence score of a given statement.