dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
860 stars 269 forks source link

foaf:depiction IRIs extracted from <gallery> sometimes contain newline character #748

Open jmkeil opened 1 year ago

jmkeil commented 1 year ago

Issue validity

This issue first occurred in a monthly job on 2023-04-10. So the problem earliest exists since 2023-03-10.

Error Description

Some object IRIs of foaf:depiction statements contain the character Newline (U+000A = \n). I think, this occurs in statements extracted from <gallery>, if the previous image caption ended with a link. I think, this is not related to #562, as the error is not caused by a character of the IRI, but the context it was extracted from.

Pinpointing the source of the error

Details

please post the details

Wrong triples RDF snippet

<http://dbpedia.org/resource/Neutron_Star_Interior_Composition_Explorer> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\u000ANICER_graphic_labeled.png> .
<http://dbpedia.org/resource/Embassy_of_Italy,_London> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\u000AEmbassy_of_Italy_in_London_4.jpg> .
<http://dbpedia.org/resource/Andriyan_Nikolayev> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\u000AБронзовый_бюст_космонавта_Николаева_(Шоршелы).jpg> .
<http://dbpedia.org/resource/Space_Shuttle_Independence> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\u000AHughes_satellite_booster_mockup_in_Shuttle_Independence_cargo_bay_(24652005131).jpg> .
<http://dbpedia.org/resource/Space_Shuttle_Independence> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\u000AShuttle_Independence_and_NASA_905_at_Space_Center_Houston.jpg> .
<http://dbpedia.org/resource/Lunar_Reconnaissance_Orbiter> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\u000ALRO_WAC_North_Pole_Mosaic_(PIA14024).jpg> .
<http://dbpedia.org/resource/Lunar_Reconnaissance_Orbiter> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\u000ALRO_WAC_South_Pole_Mosaic.jpg> .
<http://dbpedia.org/resource/Lunar_Reconnaissance_Orbiter> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\u000AMoon_Farside_LRO.jpg> .
<http://dbpedia.org/resource/IMAGE_(spacecraft)> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\u000AEarth_plasmasphere_in_EUV.jpg> .
<http://dbpedia.org/resource/Skylab_3> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/\u000AOwen_Garriott_at_the_Apollo_Telescope_Mount_console.jpg> .

Expected / corrected RDF outcome snippet

<http://dbpedia.org/resource/Neutron_Star_Interior_Composition_Explorer> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/NICER_graphic_labeled.png> .
<http://dbpedia.org/resource/Embassy_of_Italy,_London> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/Embassy_of_Italy_in_London_4.jpg> .
<http://dbpedia.org/resource/Andriyan_Nikolayev> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/Бронзовый_бюст_космонавта_Николаева_(Шоршелы).jpg> .
<http://dbpedia.org/resource/Space_Shuttle_Independence> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/Hughes_satellite_booster_mockup_in_Shuttle_Independence_cargo_bay_(24652005131).jpg> .
<http://dbpedia.org/resource/Space_Shuttle_Independence> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/Shuttle_Independence_and_NASA_905_at_Space_Center_Houston.jpg> .
<http://dbpedia.org/resource/Lunar_Reconnaissance_Orbiter> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/LRO_WAC_North_Pole_Mosaic_(PIA14024).jpg> .
<http://dbpedia.org/resource/Lunar_Reconnaissance_Orbiter> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/LRO_WAC_South_Pole_Mosaic.jpg> .
<http://dbpedia.org/resource/Lunar_Reconnaissance_Orbiter> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/Moon_Farside_LRO.jpg> .
<http://dbpedia.org/resource/IMAGE_(spacecraft)> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/Earth_plasmasphere_in_EUV.jpg> .
<http://dbpedia.org/resource/Skylab_3> <http://xmlns.com/foaf/0.1/depiction> <http://commons.wikimedia.org/wiki/Special:FilePath/Owen_Garriott_at_the_Apollo_Telescope_Mount_console.jpg> .

Example DBpedia resource URL(s)

http://dbpedia.org/resource/Neutron_Star_Interior_Composition_Explorer
http://dbpedia.org/resource/Embassy_of_Italy,_London
http://dbpedia.org/resource/Andriyan_Nikolayev
http://dbpedia.org/resource/Space_Shuttle_Independence
http://dbpedia.org/resource/Lunar_Reconnaissance_Orbiter
http://dbpedia.org/resource/IMAGE_(spacecraft)
http://dbpedia.org/resource/Skylab_3

Example Wikipedia source from http://dbpedia.org/resource/Lunar_Reconnaissance_Orbiter, errors occur for image 2, 3 and 4

<gallery caption="[[Moon|The Moon]]" heights="150px" mode="packed">
LRO WAC Nearside Mosaic.jpg |[[Near side of the Moon|Lunar near side]]
Moon Farside LRO.jpg |[[Far side of the Moon|Lunar far side]]
LRO WAC North Pole Mosaic (PIA14024).jpg|[[Lunar north pole]]
LRO WAC South Pole Mosaic.jpg|[[Lunar south pole]]
</gallery>

Example Wikipedia source from http://en.wikipedia.org/wiki/Neutron_Star_Interior_Composition_Explorer, an error occurs for image 4 only

<gallery mode="packed" heights="200">
KSC-20170603-PH AWG06 0008 (35119281635, cropped).jpg|Launch of CRS-11 with NICER aboard
Nicer-extraction-loop 0.gif|NICER extracted from Dragon's trunk at ISS
Neutron star Interior Composition Explorer (NICER) - 34718447596.jpg|Array of [[X-ray optics|X-ray lenses]]
NICER graphic labeled.png|Labeled diagram of NICER
</gallery>