dbpedia / dbpedia-docs

A tutorial about DBpedia and Linked Data in general
GNU General Public License v2.0
23 stars 5 forks source link

issue: error prone and slow conversion of geonames dump file #8

Open neradis opened 10 years ago

neradis commented 10 years ago

When I ran the download script for geonames, I got two warnings from the rapper tool:

rapper: Error - URI file:1 - Using an element 'Feature' without a namespace is forbidden.
rapper: Error -  - XML parser error: Opening and ending tag mismatch: gn:fs:long line 0 and    wgs84_pos:long

Also the process seemed to be quite slow (after about an hour less than 5% of the final triple count was completed on a server with good capacity and few load).

I used a combination of sed and perl to combine the numerous individual RDF/XML-snippets to a single big document, which was much faster and yielded no errors during subsequent conversion to ntriples with rapper:

#!/bin/bash
sink="$2"
echo '<?xml version="1.0" encoding="UTF-8" standalone="no"?><rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:gn="http://www.geonames.org/ontology#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#">' > "$sink"

sed '2~2p' < $1 | perl -pe 's|^<\?xml.+\><rdf:RDF.+?>(.*)</rdf:RDF>$|\1|' >> "$sink"

echo '</rdf:RDF>' >> "$sink"

$1 - extracted geoames dump file all-geonames-rdf.txt $2 - destination for bis RDF/XML-file

perl is usually available by default on all Linux boxes, but in principle it could also be replaced with awk to reduce dependencies.