dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
856 stars 269 forks source link

New extractor requirement #700

Open mubashar1199 opened 3 years ago

mubashar1199 commented 3 years ago

Hello,

I want to create a new extractor but i am unable to understand the following:

1: I want to create new output dataset file, just creating a new dataset in Dataset.scala is not working for me.

2: I want to iterate all the rdf triples in mappingsbased-objects-uncleaned,ttl.bz2 file, perform some processing and then generate new rdf triples in a newly created dataset file. It is also required to run this at last when all other extraction has been done. In the gender extractor following comment is written: // Even better: in the first extraction pass, extract all types. Use them in the second pass. How this multipass functionality can be implemented?

Please tell me how can i perform above operations

Thanks

jimkont commented 3 years ago

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about.

Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets

Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

JJ-Author commented 3 years ago

this seems like a post-processing step. check this out http://dev.dbpedia.org/Post-Processing

mubashar1199 commented 3 years ago

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about.

Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets

Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

Yes i want to post-process the dataset to generate new triples and append these triples to existing dataset or create the new dataset for newly created triples. How that can be done using extraction framework?

mubashar1199 commented 3 years ago

this seems like a post-processing step. check this out http://dev.dbpedia.org/Post-Processing

Ok i will take a look

JJ-Author commented 3 years ago

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about. Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

Yes i want to post-process the dataset to generate new triples and append these triples to existing dataset or create the new dataset for newly created triples. How that can be done using extraction framework?

the approach definitively is to create a new "dataset" here. However this postprocessing, does not necessarily have to be fully integrated into the extraction framework it can be also derived from the marvin extraction on the Databus https://databus.dbpedia.org/marvin/mappings/mappingbased-objects-uncleaned/ Tell us please what triples you would like to generate and what tools you are going to use (any other external data dependencies) then @Vehnem can help you how and where to integrate.

mubashar1199 commented 3 years ago

Hi @mubashar1199 , typically, an extractor is a scala class that is run on every wikipedia page and tries to extract specific information from the page. For example the existing LabelExtractor extracts the page name, the HomepageExtractor tries to detect the homepage of the person/organization that is a wikipedia page is about. Each extractor writes the extracted triples in specific datasets. It is usually a 1-1 mapping e.g. LabelExtractor -> label dataset but some extractors that get a lot of information may split the data in multiple datasets Are you trying to write a new extractor or post-process the existing datasets to form a new dataset

Yes i want to post-process the dataset to generate new triples and append these triples to existing dataset or create the new dataset for newly created triples. How that can be done using extraction framework?

the approach definitively is to create a new "dataset" here. However this postprocessing, does not necessarily have to be fully integrated into the extraction framework it can be also derived from the marvin extraction on the Databus https://databus.dbpedia.org/marvin/mappings/mappingbased-objects-uncleaned/ Tell us please what triples you would like to generate and what tools you are going to use (any other external data dependencies) then @Vehnem can help you how and where to integrate.

I want to use wikipedia info box properties and based on some predefined rules, infer new information from those properties and append the already existed dataset. I want the results to appear in sparql public endpoint. Please tell me how and where to integrate it. Thanks

kurzum commented 3 years ago

this seems like a post-processing step. check this out http://dev.dbpedia.org/Post-Processing

@JJ-Author post-processing is pretty much the worst place to add anything. We discussed this a lot and the plan is to implement post-processing via the databus and thus remove it completely.

@mubashar1199 these are the insertion points for new data into DBpedia:

More info from Wikipedia

If you think there is non-covered info in Wikipedia, that is not yet covered by the extraction:

  1. fix or slightly extend an existing extractor, slightly because major extensions might be better suited for a new extractor
  2. write a new extractor, in this case you need add a new dataset and write scala code as @jimkont explained

Adding extensions based on the extracted data

Very similar to post-processing, i.e. you work on one of the extracted datasets such as mappingbased-extraction In this case it is simple: you use the Databus to read, process it and write a new artifact on the databus. We can then include it in the snapshot collection. Ideally, you wrap it into Docker (https://hub.docker.com/u/dbpedia) and then we could run it every three months. An example is LHD, which takes the abstracts and produces https://databus.dbpedia.org/propan/lhd or sd types. A note here: What kind of rules are you talking about? Mappingsbased extraction is already a rule based approach from dbr: to dbo: . So the rules might be covered in mappings.dbpedia.org already.