Closed jimkont closed 10 years ago
I think this means that I'll need to create a http://mappings.dbpedia.org/index.php/OntologyProperty:fileExtension?
I've written a basic FileTypeExtractor, using dbo:type as a stand-in, at @fdd17fc -- you can see the results at https://github.com/gaurav/commons-extraction/blob/master/commonswiki/20140227/commonswiki-20140227-file-info.ttl
Would the fileExtension property look something like this: http://mappings.dbpedia.org/index.php/User:Gaurav/OntologyProperty:fileExtension
I guess the domain will eventually change to "dbo:File" or "dbo:Media" or something.
Cool!! Very good:)
yup, you should use http://mappings.dbpedia.org/index.php/User:Gaurav/ OntologyProperty:fileExtension and we'll think of the domain shortly :)
we'll also have to take care of corner cases like:
For the latter we can create a dump from the full commons dump and run something like bzcat dump_name | cut -d ' ' -f 3 | sort | uniq -c then see the results and check what type of extensions we have to skip (like max size, etc)
On Sat, May 24, 2014 at 4:30 AM, Gaurav Vaidya notifications@github.comwrote:
Would the fileExtension property look something like this: http://mappings.dbpedia.org/index.php/User:Gaurav/OntologyProperty:fileExtension
I guess the domain will eventually change to "dbo:File" or "dbo:Media" or something.
— Reply to this email directly or view it on GitHubhttps://github.com/gaurav/extraction-framework/issues/3#issuecomment-44074077 .
Kontokostas Dimitris
I've moved this to @91ca078, so it'll be ready to integrate once issue #6 is merged into master.
As of @3100243, FileTypeExtractor uses http://mappings.dbpedia.org/index.php/OntologyProperty:FileExtension (as 'http://dbpedia.org/ontology/fileExtension'). However, this property doesn't display its domain/range at the URL, unlike, say, http://dbpedia.org/ontology/abbreviation, which has a nearly identical description on the mappings wiki (http://mappings.dbpedia.org/index.php/OntologyProperty:Abbreviation). @jimkont: any idea why this is?
For future reference, this took 1 hr 42 mins on my laptop, which is probably slower than my work computer. Still, 0.2431ms/page -- not bad! Trying to compress the results for upload now, if possible.
Here are the results! I had to use ">" as a delimiter -- I think there are some spaces in the subject somewhere -- but that seemed to do the trick!
18261715 "jpg"^^<http://www.w3.org/2001/XMLSchema#string
1307062 "png"^^<http://www.w3.org/2001/XMLSchema#string
833905 "svg"^^<http://www.w3.org/2001/XMLSchema#string
308434 "ogg"^^<http://www.w3.org/2001/XMLSchema#string
240651 "pdf"^^<http://www.w3.org/2001/XMLSchema#string
154452 "gif"^^<http://www.w3.org/2001/XMLSchema#string
145270 "jpeg"^^<http://www.w3.org/2001/XMLSchema#string
136164 "tif"^^<http://www.w3.org/2001/XMLSchema#string
33931 "ogv"^^<http://www.w3.org/2001/XMLSchema#string
33488 "djvu"^^<http://www.w3.org/2001/XMLSchema#string
16192 "tiff"^^<http://www.w3.org/2001/XMLSchema#string
6695 "webm"^^<http://www.w3.org/2001/XMLSchema#string
3611 "mid"^^<http://www.w3.org/2001/XMLSchema#string
1793 "oga"^^<http://www.w3.org/2001/XMLSchema#string
1050 "flac"^^<http://www.w3.org/2001/XMLSchema#string
605 "xcf"^^<http://www.w3.org/2001/XMLSchema#string
245 "wav"^^<http://www.w3.org/2001/XMLSchema#string
160 "kml"^^<http://www.w3.org/2001/XMLSchema#string
34 "js"^^<http://www.w3.org/2001/XMLSchema#string
2 "jpe"^^<http://www.w3.org/2001/XMLSchema#string
1 # started 2014-06-03T00:34:22Z
1 # completed 2014-06-03T02:16:41Z
1 "test"^^<http://www.w3.org/2001/XMLSchema#string
1 "sgv"^^<http://www.w3.org/2001/XMLSchema#string
Excluded redirects in @72e3591. Added a warning about long extensions in @f06ce39.
First part done in @49f31e4; now to make the code clearer and more readable.
Don't forget to close all related issues in the pull request when you merge this in.
Create a simple FileTypeExtractor that for now should generate a triple like: