dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
855 stars 269 forks source link

films conflated with their soundtracks #482

Open chrysn opened 8 years ago

chrysn commented 8 years ago

(i hope this is the right place to report this -- if not, please met me know whom to talk to).

due to the frequent presence of sound track infoboxes in films, films' dbpedia resources get polluted with metadata from the soundtracks. examples:

it seems to me that the respective table / infobox extractors are doing roughly the right thing, but the extraction process can't tell when the wikipedia author is switching topics.

chile12 commented 8 years ago

Yes this is a common problem in the German dbpedia for Person <-> ChartPlacement as well. You could have a look at the alternative type statements in the following datasets (for en), which are based on different algorithms for extracting types automatically (not based on infoboxes):

We are discussing how to deal with issues like this in the background. Watch out for upcoming changes regarding this.

m1ci commented 4 years ago

an update @chrysn: just checked for the latest (2020.04.01) release https://databus.dbpedia.org/dbpedia/generic/infobox-properties/ and there only one dbp:length triple

<http://dbpedia.org/resource/Taxi_Driver> <http://dbpedia.org/property/length> "3693.0"^^<http://dbpedia.org/datatype/second> .

The value is also correct, 3693 seconds == 61m 33s

Since this info in from the generic extraction, I'm not sure if we can think of a reasonable test. But we can write one for this.

chrysn commented 4 years ago

The time values are gone now because the WP article is not that detailed any more. The underlying problem still persists and manifests itself in the statements <http://dbpedia.org/resource/Taxi_Driver> <http://dbpedia.org/property/name> "Taxi Driver"@en, "Taxi Driver: Original Soundtrack Recording"@en . -- both the info box about the film and the inside info box about the soundtrack are related to the topic of the article.

In manually created metadata (say, RDFa), that would be easy to solve by packing the "Music" chapter into its own named context, such that the above statements would be <http://dbpedia.org/resource/Taxi_Driver> <http://dbpedia.org/property/name> "Taxi Driver"@en. <http://dbpedia.org/resource/Taxi_Driver#Music> <http://dbpedia.org/property/name> "Taxi Driver: Original Soundtrack Recording"@en . Obviously, with the infobox-based extraction from WP it's harder, and I know too little about the general mechanisms involved here to make a concrete usable suggestion.