dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
850 stars 270 forks source link

citationIri concerns #452

Open VladimirAlexiev opened 8 years ago

VladimirAlexiev commented 8 years ago

I have a few concerns about citationIri. It's trying to make a URL for the citation from its properties:

  1. @jimkont please confirm that even though it's a for loop, it'll execute no more than once
  2. What if neither of the cases match? We still need a URL, so we must make a local node (see next)
  3. Since the citation may have local props (eg "pages"), it's not quite correct to use a global URL, unless it reflects all these local props. In such case we need to make a local node, which refers to the global URL (eg using dct:isPartOf)
  4. For ISBN and ISSN, how are we sure that they're available on GBooks?
  5. More cases should be added, eg if there's "arxhiv" id, then make a http://arxiv.org URL
  6. @nfreire: TEL has some 109M bibliographic records (adding 60M more), maybe we can use their URLs? How are they identified? BTW they use RDA, so that should be considered for https://github.com/dbpedia/mappings-tracker/issues/79
jimkont commented 8 years ago

@jimkont please confirm that even though it's a for loop, it'll execute no more than once

yes, for loops do not execute more than onece

What if neither of the cases match? We still need a URL, so we must make a local node (see next) Since the citation may have local props (eg "pages"), it's not quite correct to use a global URL, unless it reflects all these local props. In such case we need to make a local node, which refers to the global URL (eg using dct:isPartOf)

This needs some investigation

For ISBN and ISSN, how are we sure that they're available on GBooks?

I added this mostly to provide a stable ID

More cases should be added, eg if there's "arxhiv" id, then make a http://arxiv.org URL

sounds good :)

@nfreire: TEL has some 109M bibliographic records (adding 60M more), maybe we can use their URLs? How are they identified? BTW they use RDA, so that should be considered for dbpedia/mappings-tracker#79

not sure if that can be done directly at extraction time or with a post-processing step but we are open to all suggestions

jimkont commented 8 years ago

I improved the citation IRI issue with most of the IDs I could find in the citation template documentations https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/CitationExtractor.scala#L271

The problem now is what to do with citations that have no ID or URL, these are for now skipped but I could create a UUID for those, what do you think?

VladimirAlexiev commented 8 years ago

Both cases doi and jstor check the field "doi". That's not wrong, as there are many DOI resolvers, see https://www.wikidata.org/wiki/Property:P356#P1630. But the second case is ineffectual, no?

Consider item 3 above. If a book or journal is cited in 1000 wikipedia articles, each will use the same ISBN or ISSN, and you'll generate the same citationIri. But if each cites a different chapter or article, it will have different title, pages, authors etc etc. You'll emit all these statements against the same citationIri, thus jumble them together.

Therefore all citations need their own URL, except those for which we can guarantee they cite individual items (arxiv, pmc, pubmed; DOI can reference either a book or an article so is not individual). Then we link this "own" node to the book or article, eg using dct:isPartOf

"Own URL" could mean:

jimkont commented 8 years ago

I kept the existing naming convention for now but included a hash-based IRI for citations that have no ID. What you say makes sense but will be better handled when this is moved to the mappings wiki otherwise it requires a lot of hardcoding