europarl / open-data-beta-testing

European Parliament Open Data - Call for beta testers
https://europarl.github.io/open-data-beta-testing
34 stars 3 forks source link

Duplicates in MEP membership data #6

Closed philippbroniecki closed 2 years ago

philippbroniecki commented 2 years ago

I am running the following query on all ttl files from the meps folder (data_v1_meps_0.ttl.txt - data_v1_meps_9.ttl.txt). The output dataset that I get is 195,922 observations but there are only 84,686 unique observations. Is this a mistake or should there be duplicates?

membership_query <- 'PREFIX dc: http://purl.org/dc/elements/1.1/ PREFIX skos: http://www.w3.org/2004/02/skos/core# PREFIX org: http://www.w3.org/ns/org# PREFIX epvoc: https://data.europarl.europa.eu/def/epvoc# SELECT ?membership_id ?time_period ?organization ?role ?membership_class WHERE { ?s dc:identifier ?membership_id. ?s org:memberDuring ?time_period. ?s org:organization ?organization. ?s org:role ?role. ?s epvoc:membershipClassification ?membership_class. } '

tfrancart commented 2 years ago

The output dataset that I get is 195,922 observations but there are only 84,686 unique observations

You might want to double check your data loading or named graph setup. Here is what I get in GraphDB 9.10:

PREFIX org: <http://www.w3.org/ns/org#>
SELECT (COUNT(?membership) AS ?nbMembership)
WHERE {
    ?membership a org:Membership .
}

Returns 103666

Your query

PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX org: <http://www.w3.org/ns/org#>
PREFIX epvoc: <https://data.europarl.europa.eu/def/epvoc#>
SELECT ?membership_id ?time_period ?organization ?role
?membership_class
WHERE {
?s dc:identifier ?membership_id.
?s org:memberDuring ?time_period.
?s org:organization ?organization.
?s org:role ?role.
?s epvoc:membershipClassification ?membership_class.
}

Returns 84,686 rows.

Adding an OPTIONAL on the epvoc:membershipClassification we get 103666 rows, which is consistent with the COUNT query.

PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX org: <http://www.w3.org/ns/org#>
PREFIX epvoc: <https://data.europarl.europa.eu/def/epvoc#>
SELECT ?membership_id ?time_period ?organization ?role
?membership_class
WHERE {
?s dc:identifier ?membership_id.
?s org:memberDuring ?time_period.
?s org:organization ?organization.
?s org:role ?role.
OPTIONAL { ?s epvoc:membershipClassification ?membership_class. }
}

I suspect there is an error in the SVG diagram in the documentation at https://europarl.github.io/org-ep/, where epvoc:membershipClassification is depicted with cardinality [1..1], while the table below (https://europarl.github.io/org-ep/#org:Membership) describes the property with cardinality [0..1]