Closed ghost closed 10 years ago
Any comment on that little feature ? :-)
what about :IronMan :hasLinkTitle "Iron Man (film)" ? we will produce many duplicates but it is easy to sort this out
Kontokostas Dimitris
If we start collecting the link titles, we should do it right :) i.e. some titles are more frequent than others, which then should be tracked too. So we actually would need to collect statistics. I guess, this should be done in a dedicated extractor. The spotlight people already did similar, as far as I remember they apply a Pignlproc script for that issue (it does some more stats than just link titles) [https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Indexing-with-Pignlproc-and-Hadoop].
Unless we want to keep track of where the link came from what I suggested works for statistics as well. In the end, the extractor output will keep all references and we can either remove duplicates of count duplicates @mgns did you have something else in mind?
I just wanted to point at the difficulties and that similar work has already been done elsewhere. Actually I don't know where this issue has been raised, but that data without statistics might be misleading, e.g. if you have one link in Wikipedia [[Iron Man (film) | cinema]] vs 100+ [[Iron Man (film) | Iron Man]]. According to the documentation the spotlight indexer outputs a file 'sf_group.pig' containing {SurfaceForm, {(URI),...}, count}.
Maybe an idea will be to do something similar but a bit less complicated, I mean if we have :
[[Iron Man (film) | cinema]] in one page [[Iron Man (film) | Iron Man]] in 100 pages
Having a file which describe the result :
Iron Man (film)=>Iron Man : 100 Iron Man (film)=>cinema : 1
Having a sorted list from largest to smallest and keeping the largest one. No ?
Any comments on that feature ? My last idea appear enough good to test ?
This should be calculated with a post processing step. The extractors should remain stateless
On Thu, Jun 19, 2014 at 12:24 PM, Julien Plu notifications@github.com wrote:
Any comments on that feature ? My last idea appear enough good to test ?
— Reply to this email directly or view it on GitHub https://github.com/dbpedia/extraction-framework/issues/189#issuecomment-46539956 .
Kontokostas Dimitris
Ok so during the "extraction" step but before to start the extraction ? Or we should add a step before "extraction" and after "import" ?
no, after the extraction process, I meant to process the generated dumps to get these statistics
if we extract something like the following from every page dbr:IronMan(film) dbo:isReferencedWithTitle "cinema" . dbr:IronMan(film) dbo:isReferencedWithTitle "Iron Man" .
it will be easy to generate what you suggested out of the dump files
Ok, understood, is-it better to create a separate file with all these triples, or including these triples inside an already existing file ?
separate sounds better
Hi, are there any news about this issue? Would be nice to include the dataset into the new release in case the extractor is ready.
Do you know whether the Lexicalizations Dataset - http://wiki.dbpedia.org/Datasets/NLP - includes the info you want to extract?
Nop, I didn't finish yet to implement this feature, but i will update this thread once I will finish.
Thanks @jplu, I understand now. I'll document the property and close 44
Observations:
So something like this:
<http://fr.dbpedia.org/Iron_Man_(film)> dbo:linkTitle <http://fr.dbpedia.org/Iron_Man_(film)__1>.
<http://fr.dbpedia.org/Iron_Man_(film)__1> a dbo:LinkTitle;
dbo:title "Iron Man"@fr; # I'd prefer it without lang tag
dbo:number 100;
dct:source <http://fr.dbpedia.org>.
- wikiPageWikiLink are the outgoing links (without title). Do we want the link titles to be somehow related to this? I'd say no
I would say the same.
- Do we want to know the source page of a link title? I'd say no, only the count.
Why not ? I think it can be interesting to know this.
- If we use class LinkTitle to store the title & count, then a good prop name is linkTitle
Agree.
- Do we need an indicator which wikipedia the count came from?
Yes, but in each case the count will come from the Wikipedia we used to extract this information. For example if I process the English Wikipedia dump with the extraction framework, I know that the counts will come from the English Wikipedia.
- I suggest to store the title without lang tag (xsd:string not rdf:langString) since titles are often international (eg Iron Man are not French words).
Yes, but it can be the proper term used in French and only in France, not in the other French speaker countries. For example, in Montreal they love to translate all the english movie title, thing that we don't do in France. So I would keep this lang tag but with a more precise one to make the difference between French from France and French from Quebec.
<http://fr.dbpedia.org/Iron_Man_(film)> dbo:linkTitle <http://fr.dbpedia.org/Iron_Man_(film)__1>.
<http://fr.dbpedia.org/Iron_Man_(film)__1> a dbo:LinkTitle;
dbo:title "Iron Man"@fr; # I'd prefer it without lang tag
dbo:number 100;
dct:source <http://fr.dbpedia.org>.
I'm agree with this model, but I would keep the lang tag and maybe add a property to get a list of all the pages this title come from.
For popular articles this will create a lot of redundant information and many unnecessary blank/intermediate nodes. e.g. most links would be "Iron Man" in this case I would prefer something like https://github.com/dbpedia/extraction-framework/issues/189#issuecomment-46680139 If that is loaded on an endpoint duplicate titles would be deleted but if someone wants to get counts or other stats he could process the dumps and in Quads we can get the source article as well.
Recent discussion "Top subjects, predicates and objects in DBpedia" on the mlist about external datasets with somewhat similar properties (pagerank, in/out link counts). Kingsley loaded them to dbpedia.org:
sparql select count(*) from
<http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pagerank_scores_en_2014.ttl.bz2>
where {?s ?p ?o};
5,544,757 triples .
sparql select count(*) from
<http://dbpedia.semanticmultimedia.org/dbpedia2014/en/hits_scores_en_2014.ttl.bz2>
where {?s ?p ?o};
5,544,757 triples .
sparql select count(*) from
<http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageinlinkCounts_en_2014.ttl.bz2>
where {?s ?p ?o};
5,130,711 triples .
sparql select count(*) from
<http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageoutlinkCounts_en_2014.ttl.bz2>
where {?s ?p ?o};
4,582,685 triples.
The links above are broken (notified Kingsley on the mailing list). Tried to explore the structure of pageoutlinkCounts_en_2014
1.
select * {
graph <http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageoutlinkCounts_en_2014.ttl.bz2> {
?s ?p ?o
}}
limit 100
-> Error SP031: SPARQL compiler: No one quad map pattern is suitable
2.
select *
from <http://dbpedia.semanticmultimedia.org/dbpedia2014/en/pageoutlinkCounts_en_2014.ttl.bz2>
where {
?s ?p ?o
filter (?p not in (rdf:type, rdfs:subPropertyOf))
}
limit 100
0 results (after about 100s)
3.
select * {
?x dbo:wikiPageOutLinkCountCleaned ?y
} limit 100
Means that this data is not in DBpedia anymore.
bunzip2 -c pageoutlinkCounts_en_2014.ttl.bz2 >pageoutlinkCounts_en_2014.ttl
Cancel after a few seconds. All of them are like
dbr:Changzhou_Ancient_Canal dbo:wikiPageOutLinkCountCleaned 13.
Hello,
Actually we loose all the Wikipedia intern links title. For example for this internal link
[[link name | Title]]
we create the resourcehttp://dbpedia.org/resource/link
and we loose the title, which is a bit annoying because sometimes it can be pretty usefull.So I thought about a hint on how to design this, I just to know what do you think about it. Here what I thought, admit we have this infobox for a page called "IronMan(comics)" :
The corresponding RDF could be :
The blank node is a bit disturbing for me, but I don't know how to give to each title a different URI name. At least what do you think about this little hint ?
Best.
Julien.