gaurav / extraction-framework

The software used to extract structured data from Wikipedia
1 stars 0 forks source link

Create a simple FileTypeExtractor #3

Closed jimkont closed 10 years ago

jimkont commented 10 years ago

Create a simple FileTypeExtractor that for now should generate a triple like:

dbo:fileExtension "svg". once we get all extensions we can group them to generate class statements (Image, Video, etc)
gaurav commented 10 years ago

I think this means that I'll need to create a http://mappings.dbpedia.org/index.php/OntologyProperty:fileExtension?

I've written a basic FileTypeExtractor, using dbo:type as a stand-in, at @fdd17fc -- you can see the results at https://github.com/gaurav/commons-extraction/blob/master/commonswiki/20140227/commonswiki-20140227-file-info.ttl

gaurav commented 10 years ago

Would the fileExtension property look something like this: http://mappings.dbpedia.org/index.php/User:Gaurav/OntologyProperty:fileExtension

I guess the domain will eventually change to "dbo:File" or "dbo:Media" or something.

jimkont commented 10 years ago

Cool!! Very good:)

yup, you should use http://mappings.dbpedia.org/index.php/User:Gaurav/ OntologyProperty:fileExtension and we'll think of the domain shortly :)

we'll also have to take care of corner cases like:

For the latter we can create a dump from the full commons dump and run something like bzcat dump_name | cut -d ' ' -f 3 | sort | uniq -c then see the results and check what type of extensions we have to skip (like max size, etc)

On Sat, May 24, 2014 at 4:30 AM, Gaurav Vaidya notifications@github.comwrote:

Would the fileExtension property look something like this: http://mappings.dbpedia.org/index.php/User:Gaurav/OntologyProperty:fileExtension

I guess the domain will eventually change to "dbo:File" or "dbo:Media" or something.

— Reply to this email directly or view it on GitHubhttps://github.com/gaurav/extraction-framework/issues/3#issuecomment-44074077 .

Kontokostas Dimitris

gaurav commented 10 years ago

I've moved this to @91ca078, so it'll be ready to integrate once issue #6 is merged into master.

gaurav commented 10 years ago

As of @3100243, FileTypeExtractor uses http://mappings.dbpedia.org/index.php/OntologyProperty:FileExtension (as 'http://dbpedia.org/ontology/fileExtension'). However, this property doesn't display its domain/range at the URL, unlike, say, http://dbpedia.org/ontology/abbreviation, which has a nearly identical description on the mappings wiki (http://mappings.dbpedia.org/index.php/OntologyProperty:Abbreviation). @jimkont: any idea why this is?

gaurav commented 10 years ago

For future reference, this took 1 hr 42 mins on my laptop, which is probably slower than my work computer. Still, 0.2431ms/page -- not bad! Trying to compress the results for upload now, if possible.

gaurav commented 10 years ago

Here are the results! I had to use ">" as a delimiter -- I think there are some spaces in the subject somewhere -- but that seemed to do the trick!

18261715  "jpg"^^<http://www.w3.org/2001/XMLSchema#string
1307062  "png"^^<http://www.w3.org/2001/XMLSchema#string
833905  "svg"^^<http://www.w3.org/2001/XMLSchema#string
308434  "ogg"^^<http://www.w3.org/2001/XMLSchema#string
240651  "pdf"^^<http://www.w3.org/2001/XMLSchema#string
154452  "gif"^^<http://www.w3.org/2001/XMLSchema#string
145270  "jpeg"^^<http://www.w3.org/2001/XMLSchema#string
136164  "tif"^^<http://www.w3.org/2001/XMLSchema#string
33931  "ogv"^^<http://www.w3.org/2001/XMLSchema#string
33488  "djvu"^^<http://www.w3.org/2001/XMLSchema#string
16192  "tiff"^^<http://www.w3.org/2001/XMLSchema#string
6695  "webm"^^<http://www.w3.org/2001/XMLSchema#string
3611  "mid"^^<http://www.w3.org/2001/XMLSchema#string
1793  "oga"^^<http://www.w3.org/2001/XMLSchema#string
1050  "flac"^^<http://www.w3.org/2001/XMLSchema#string
 605  "xcf"^^<http://www.w3.org/2001/XMLSchema#string
 245  "wav"^^<http://www.w3.org/2001/XMLSchema#string
 160  "kml"^^<http://www.w3.org/2001/XMLSchema#string
  34  "js"^^<http://www.w3.org/2001/XMLSchema#string
   2  "jpe"^^<http://www.w3.org/2001/XMLSchema#string
   1 # started 2014-06-03T00:34:22Z
   1 # completed 2014-06-03T02:16:41Z
   1  "test"^^<http://www.w3.org/2001/XMLSchema#string
   1  "sgv"^^<http://www.w3.org/2001/XMLSchema#string
gaurav commented 10 years ago
gaurav commented 10 years ago

Excluded redirects in @72e3591. Added a warning about long extensions in @f06ce39.

gaurav commented 10 years ago
gaurav commented 10 years ago

First part done in @49f31e4; now to make the code clearer and more readable.

gaurav commented 10 years ago

Don't forget to close all related issues in the pull request when you merge this in.