dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
853 stars 269 forks source link

Keeping links title #189

Closed ghost closed 10 years ago

ghost commented 10 years ago

Hello,

Actually we loose all the Wikipedia intern links title. For example for this internal link [[link name | Title]] we create the resource http://dbpedia.org/resource/link and we loose the title, which is a bit annoying because sometimes it can be pretty usefull.

So I thought about a hint on how to design this, I just to know what do you think about it. Here what I thought, admit we have this infobox for a page called "IronMan(comics)" :

{{Infobox Personnage (fiction)
| film                  = [[Iron Man (film)|Iron Man]]
}}

The corresponding RDF could be :

<http://dbpedia.org/resource/Iron_Man_(comics)> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/FictionalCharacter> .
<http://dbpedia.org/resource/Iron_Man_(comics)> <http://dbpedia.org/ontology/movie> <http://dbpedia.org/resource/Iron_Man_(film)> .
<http://dbpedia.org/resource/Iron_Man_(film)> <http://dbpedia.org/ontology/hasLinkTitle> _:title .
_:title <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/LinkTitle> .
_:title <http://www.w3.org/2000/01/rdf-schema#label> "Iron Man"@fr .
_title <http://dbpedia.org/ontology/linkTitleFrom> <http://dbpedia.org/resource/Iron_Man_(film)> .

The blank node is a bit disturbing for me, but I don't know how to give to each title a different URI name. At least what do you think about this little hint ?

Best.

Julien.

ghost commented 10 years ago

Any comment on that little feature ? :-)

jimkont commented 10 years ago

what about :IronMan :hasLinkTitle "Iron Man (film)" ? we will produce many duplicates but it is easy to sort this out

Kontokostas Dimitris

mgns commented 10 years ago

If we start collecting the link titles, we should do it right :) i.e. some titles are more frequent than others, which then should be tracked too. So we actually would need to collect statistics. I guess, this should be done in a dedicated extractor. The spotlight people already did similar, as far as I remember they apply a Pignlproc script for that issue (it does some more stats than just link titles) [https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Indexing-with-Pignlproc-and-Hadoop].

jimkont commented 10 years ago

Unless we want to keep track of where the link came from what I suggested works for statistics as well. In the end, the extractor output will keep all references and we can either remove duplicates of count duplicates @mgns did you have something else in mind?

mgns commented 10 years ago

I just wanted to point at the difficulties and that similar work has already been done elsewhere. Actually I don't know where this issue has been raised, but that data without statistics might be misleading, e.g. if you have one link in Wikipedia [[Iron Man (film) | cinema]] vs 100+ [[Iron Man (film) | Iron Man]]. According to the documentation the spotlight indexer outputs a file 'sf_group.pig' containing {SurfaceForm, {(URI),...}, count}.

jplu commented 10 years ago

Maybe an idea will be to do something similar but a bit less complicated, I mean if we have :

[[Iron Man (film) | cinema]] in one page [[Iron Man (film) | Iron Man]] in 100 pages

Having a file which describe the result :

Iron Man (film)=>Iron Man : 100 Iron Man (film)=>cinema : 1

Having a sorted list from largest to smallest and keeping the largest one. No ?

ghost commented 10 years ago

Any comments on that feature ? My last idea appear enough good to test ?

jimkont commented 10 years ago

This should be calculated with a post processing step. The extractors should remain stateless

On Thu, Jun 19, 2014 at 12:24 PM, Julien Plu notifications@github.com wrote:

Any comments on that feature ? My last idea appear enough good to test ?

— Reply to this email directly or view it on GitHub https://github.com/dbpedia/extraction-framework/issues/189#issuecomment-46539956 .

Kontokostas Dimitris

ghost commented 10 years ago

Ok so during the "extraction" step but before to start the extraction ? Or we should add a step before "extraction" and after "import" ?

jimkont commented 10 years ago

no, after the extraction process, I meant to process the generated dumps to get these statistics

if we extract something like the following from every page dbr:IronMan(film) dbo:isReferencedWithTitle "cinema" . dbr:IronMan(film) dbo:isReferencedWithTitle "Iron Man" .

it will be easy to generate what you suggested out of the dump files

ghost commented 10 years ago

Ok, understood, is-it better to create a separate file with all these triples, or including these triples inside an already existing file ?

jimkont commented 10 years ago

separate sounds better

VolhaBryl commented 10 years ago

Hi, are there any news about this issue? Would be nice to include the dataset into the new release in case the extractor is ready.

Do you know whether the Lexicalizations Dataset - http://wiki.dbpedia.org/Datasets/NLP - includes the info you want to extract?

jplu commented 10 years ago

Nop, I didn't finish yet to implement this feature, but i will update this thread once I will finish.

VladimirAlexiev commented 9 years ago

Thanks @jplu, I understand now. I'll document the property and close 44

Observations:

  1. wikiPageWikiLink are the outgoing links (without title). Do we want the link titles to be somehow related to this? I'd say no
  2. Do we want to know the source page of a link title? I'd say no, only the count.
  3. If we use class LinkTitle to store the title & count, then a good prop name is linkTitle
  4. Do we need an indicator which wikipedia the count came from? That's reflected only in the URL of the intermediate node. If you put several dbpedias in one repo and do the owl:sameAs smushing, you may want to know. Unfortunately there's no appropriate xsd:string property yet.
  5. I suggest to store the title without lang tag (xsd:string not rdf:langString) since titles are often international (eg Iron Man are not French words).

So something like this:

<http://fr.dbpedia.org/Iron_Man_(film)> dbo:linkTitle <http://fr.dbpedia.org/Iron_Man_(film)__1>.
<http://fr.dbpedia.org/Iron_Man_(film)__1> a dbo:LinkTitle;
  dbo:title "Iron Man"@fr;  # I'd prefer it without lang tag
  dbo:number 100;
  dct:source <http://fr.dbpedia.org>.
jplu commented 9 years ago
  1. wikiPageWikiLink are the outgoing links (without title). Do we want the link titles to be somehow related to this? I'd say no

I would say the same.

  1. Do we want to know the source page of a link title? I'd say no, only the count.

Why not ? I think it can be interesting to know this.

  1. If we use class LinkTitle to store the title & count, then a good prop name is linkTitle

Agree.

  1. Do we need an indicator which wikipedia the count came from?

Yes, but in each case the count will come from the Wikipedia we used to extract this information. For example if I process the English Wikipedia dump with the extraction framework, I know that the counts will come from the English Wikipedia.

  1. I suggest to store the title without lang tag (xsd:string not rdf:langString) since titles are often international (eg Iron Man are not French words).

Yes, but it can be the proper term used in French and only in France, not in the other French speaker countries. For example, in Montreal they love to translate all the english movie title, thing that we don't do in France. So I would keep this lang tag but with a more precise one to make the difference between French from France and French from Quebec.

<http://fr.dbpedia.org/Iron_Man_(film)> dbo:linkTitle <http://fr.dbpedia.org/Iron_Man_(film)__1>.
<http://fr.dbpedia.org/Iron_Man_(film)__1> a dbo:LinkTitle;
  dbo:title "Iron Man"@fr;  # I'd prefer it without lang tag
  dbo:number 100;
  dct:source <http://fr.dbpedia.org>.

I'm agree with this model, but I would keep the lang tag and maybe add a property to get a list of all the pages this title come from.

jimkont commented 9 years ago

For popular articles this will create a lot of redundant information and many unnecessary blank/intermediate nodes. e.g. most links would be "Iron Man" in this case I would prefer something like https://github.com/dbpedia/extraction-framework/issues/189#issuecomment-46680139 If that is loaded on an endpoint duplicate titles would be deleted but if someone wants to get counts or other stats he could process the dumps and in Quads we can get the source article as well.

VladimirAlexiev commented 9 years ago

Recent discussion "Top subjects, predicates and objects in DBpedia" on the mlist about external datasets with somewhat similar properties (pagerank, in/out link counts). Kingsley loaded them to dbpedia.org:

sparql select count(*) from 
<http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pagerank_scores_en_2014.ttl.bz2> 
where {?s ?p ?o};
5,544,757 triples .

sparql select count(*) from 
<http://dbpedia.semanticmultimedia.org/dbpedia2014/en/hits_scores_en_2014.ttl.bz2> 
where {?s ?p ?o};
5,544,757 triples .

sparql select count(*) from 
<http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageinlinkCounts_en_2014.ttl.bz2> 
where {?s ?p ?o};
5,130,711 triples .

sparql select count(*) from 
<http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageoutlinkCounts_en_2014.ttl.bz2> 
where {?s ?p ?o};
4,582,685 triples.
  1. http://dbpedia.org/c/9BXNZBHH -- Query Result
  2. http://dbpedia.org/c/9DH7LBEC -- Query Definition (dbo:wikiPageOutLinkCountCleaned)
VladimirAlexiev commented 7 years ago

The links above are broken (notified Kingsley on the mailing list). Tried to explore the structure of pageoutlinkCounts_en_2014

1.

select * {
graph <http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageoutlinkCounts_en_2014.ttl.bz2> {
  ?s ?p ?o
}}
limit 100

-> Error SP031: SPARQL compiler: No one quad map pattern is suitable

2.

select * 
from <http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageoutlinkCounts_en_2014.ttl.bz2> 
where {
  ?s ?p ?o
  filter (?p not in (rdf:type, rdfs:subPropertyOf))
} 
limit 100

0 results (after about 100s)

3.

select * {
?x dbo:wikiPageOutLinkCountCleaned ?y
} limit 100

Means that this data is not in DBpedia anymore.

  1. Oh well, let's just have to look inside http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageoutlinkCounts_en_2014.ttl.bz2 (46Mb)
    bunzip2 -c pageoutlinkCounts_en_2014.ttl.bz2 >pageoutlinkCounts_en_2014.ttl

    Cancel after a few seconds. All of them are like

    dbr:Changzhou_Ancient_Canal dbo:wikiPageOutLinkCountCleaned 13.