fusepool / fusepool-dlc-patents

This project provides a bundle with a service to transform patents in Marec XML format to RDF
Apache License 2.0
2 stars 0 forks source link

An inventor present in different formats in the XML should appear only once in RDF #8

Open retog opened 10 years ago

retog commented 10 years ago

This XML

<inventor status="new" format="epo">
  <addressbook>
    <name>KOSMAN WILHELMUS JACOBUS MARIA</name>
    <address>
      <country>NL</country>
    </address>
  </addressbook>
</inventor>
<inventor status="new" format="intermediate">
  <addressbook>
    <name>KOSMAN, WILHELMUS JACOBUS MARIA</name>
  </addressbook>
</inventor>
<inventor status="new" format="original">
  <addressbook>
    <last-name>KOSMAN, WILHELMUS JACOBUS MARIA</last-name>
    <address>
      <street>Bredeweg 9</street>
      <city>6562 DA Groesbeek</city>
      <country>NL</country>
    </address>
  </addressbook>
</inventor>

Currently result in 3 person and 2 address resources to be in the RDF, there should only be 1 person with 1 address.

@csarven, could you write a patch for this?

csarven commented 10 years ago

I don't know if you received the memo or not, but transformation is not the place where we do reconciliation or disambiguation for something like this. Check the rest of the pipeline where it may be appropriate.

retog commented 10 years ago

I think there is some semantic information which is relevant to establish identity that gets lost, this is that the entries appear one after the other in the XML and have different format identifier.

Please add links when you refer to something like "the memo".

csarven commented 10 years ago

If I understand you correctly, this might address your issue:

It was decided over a year ago not to do comparison of entities at the transformation level, regardless of the degree of their "similarity" based on their visual appearance. Unfortunately, there is no semantic information in the documents or documentation which indicates or suggests that simple normalization of the strings is reliable way of insuring their semantic similarity.

In addition to above, syntactic ordering in documents for whatever reason they happen to be in is not reliable due to the way XML and XSLT processors walk around the tree in different ways and reasons.

If you want to take true semantic information into account in your quest to disambiguate or interlink entities, then you should make use of some of the relations that are already created e.g., x inventor y, y inventorOf x, and/or other data about the entities, after the transformation phase. That is to say, you might want to first disambiguate the entities in the same document based on the measures that's meaningful to you, then do it against rest of the entities.

retog commented 10 years ago

Well it's up to the format to define if the ordering is relevant or not. XML and XSLT tools are usually quite capable to deal with XML where the order is relevant. Where is this XML format specified?

csarven commented 10 years ago

Well, from your earlier comment, it is you that thinks the ordering matters (which may really be in the end), but I suggest that you investigate that before looking for a solution. Burden of proof lies on you, no?

IIRC, order is not mentioned in IREC patent-document.dtd and I can't recall reading about that in relevant documentation.

In any case, ordering still doesn't imply anything more than about their sequence and relationship. Again, what you raise is not for the transformer to solve. Even if there is a hacked-up solution, reconciliation still has to happen at a later phase. By bringing a partial reconciliation phase here, you are increasing the complexity of the transformation immensely.

retog commented 10 years ago

Again, where is that XML format documented? All I can do is guessing from the labels fro the elements and attributes, which is not really satisfactory.

csarven commented 10 years ago

Have you tried $ find . -name patent-document.dtd or anything like that?

retog commented 10 years ago

So all documentation there is some dtd of unknown provenance that used to be in the git repo? Don't you think something we are missing something? With http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd you can validate XHTML but not build a browser, there is quite some documentation for that. Are we just guessing what the elements describe?

csarven commented 10 years ago

https://github.com/fusepool/patents-reengineering/blob/master/src/main/resources/dtd/patent-document.dtd doesn't look like it was carried over to fusepool-dlc-patents.

If you take a look at an example MAREC XML, you'll notice a typical DOCTYPE line:

<!DOCTYPE patent-document PUBLIC "-//MXW//DTD patent-document XML//EN" "http://www.ir-facility.org/dtds/patents/v1.4/patent-document.dtd">

Unfortunately ir-facility.org is not responding at the moment, but that's where I grabbed it from. The patent-document.dtd also contains some information. You can take a look at the section "Relevant specifications/DTDs/Schemas" as well and follow through.

The data mapping was originally based on the DTD from ir-facility.org which provided the MAREC corpus. While the XSL templates was more or less working/works on patents from EPO, USPTO, JP, WIPO, it is not bullet-proof. Full scale tests was not conducted for a number of reasons even on MAREC data, e.g., running the transformation on all of MAREC data (there were server issues), unable to run SPARQL queries over all of the transformed MAREC data (this is something still not possible AFAIK) to catch obvious issues, and so on. Majority of the work on templates was done about a year ago. Here we are. Can the magnificent Fusepool platform be of any help here? I mean any. Let me know. It would be great to have SPARQL 1.1 (which I also asked for over a year ago and was promised to have it) and all of the MAREC data loaded in some temporary location so I can investigate further and update the templates as needed.

patent-document.dtd declares possible formats: original|standard|epo|uspto|intermediate.

There is a MAREC 1.0 User Guide which talks about epo, intermediate, original inventor formats. They could be normalized, with some special treatment to original. The document excludes how uspto and standard formats are treated for inventors. I can't find that document anywhere on the Web at the moment. I don't know if it is still valid either.

Going with the assumption that chaos will not ensue by implementing a normalization process, we could have 2 or 3 of the inventors in epo, intermediate, original as a single entity. How that impacts the other formats is not certain, but if left alone, they'd be treated as they normally are as if not having to go through any normalization. So, they would be as is.

For the epo, intermediate, original formats, inventors will probably end up with all of the labels in the output (as it is not for certain which one is more reliable or present). That's going to be ugly since we already have both foaf:whateverIsAppropriate plus rdfs:label.

This is all so that:

  1. Off-load reconciliation to the transformation phase
  2. You don't have to hack-up any new code in Java (I'm sure there is an excellent reason for that too)

Is there anything you would like me to investigate that you were not able to dig up yourself before opening up this issue?