ceurws / ToolsEswc

Tools and extensions for the SemPub2015 co-located with the Extended Semantic Web Conference 2015
GNU General Public License v3.0
0 stars 0 forks source link

merge different spellings of the same person within the context of one proceedings volume #3

Closed S6savahd closed 8 years ago

S6savahd commented 8 years ago

-in vol-1 F. Baader and Franz Baader -in vol-1 A. Düsterhöft and A. Dusterhoft

liyakun commented 8 years ago

For the example in Vol-6

<http://ceur-ws.org/Vol-6/#CoeneniVA> a foaf:Person ;
    foaf:made <http://ceur-ws.org/Vol-6/#paperX51>,
        <http://ceur-ws.org/Vol-6/#paperX67> ;
    foaf:name "Coenen VA" .

<http://ceur-ws.org/Vol-6/#CoeneniV> a foaf:Person ;
    foaf:made <http://ceur-ws.org/Vol-6/#paperX66> ;
    foaf:name "Coenen V"

The current implementation can achieve this

<http://ceur-ws.org/Vol-6/#CoeneniVA> a foaf:Person ;
    foaf:made <http://ceur-ws.org/Vol-6/#paperX51>,
        <http://ceur-ws.org/Vol-6/#paperX66>,
        <http://ceur-ws.org/Vol-6/#paperX67> ;
    foaf:name "Coenen V",
        "Coenen VA" .

The removed name will be attached to the used name, and the additional informations for these two person are merged.

The only issue I considered is that the current same user name checking method is not able to always produce correct result, the method is in serializer.py.

How do you think @S6savahd @clange ?

S6savahd commented 8 years ago

what do you mean from "the current same user name checking method is not able to always produce correct result"?

liyakun commented 8 years ago

The method will calculate string similarity, which sometimes may output 'same person' when they are not.

S6savahd commented 8 years ago

This is not good. Can you give some examples?

liyakun commented 8 years ago

@S6savahd , the example in Vol-6 could be one, as it is hard to tell whether "Coenen V" and "Coenen VA" are the same person.

S6savahd commented 8 years ago

This is different than what you've reported. -in http://ceur-ws.org/Vol-6/ WG013 ---> Coenen V is the same person as Coenen VA. I assume text similarity finds them as 'same person', right?

Yakun said: The method will calculate string similarity, which sometimes may output 'same person' when they are not. any example that shows two different person names as 'same person'?

liyakun commented 8 years ago

@S6savahd not same person, output same person

Michaelis M, Michaelis B, fuzz.WRatio('Michaelis M', 'Michaelis B') = 91

same person, output not same person

F. Baader and Franz Baader, fuzz.WRatio('F.Baader', 'Franz Baader') = 86

S6savahd commented 8 years ago

let's discuss this on Friday

S6savahd commented 8 years ago

if you can find patterns, let's try to split the text into given name and family name also handle initials as the special case.

after this we should be able to have an estimation on how many of them are wrong and try to fix them manually

liyakun commented 8 years ago

@S6savahd I think a general pattern for name is space separated, I redefined the similarity checking of user name as follows:

  1. Comparing the whole name string similarity, if it is not less than 90, the return as same person, o/w continue
  2. split user name by white space delimiter, if two user have the same length, continue
  3. compare each part sequentially of two split result, if all parts of two user have similarity >= 89, then assert they are same person

A simple example is F. Baader and Franz Baader, where

  1. WRatio('F. Baader', 'Franz Baader') = 82, which is less than 90, continue
  2. length(['F.', 'Baader']) == length(['Franz', 'Baader'])
  3. WRatio('F.', 'Franz') = 90, WRatio('Baader', 'Baader') = 100

then we say F. Baader and Franz Baader are the same person, it is not easy to define a good threshold for similarity, 90 is currently used, and it is implemented in serializer.py.

I added one more checking at the beginning for the whole pattern, because if some of the user name is long, if we compare only by splitting on each separate part, then the whole comparison will be influenced by individual separate part too much.

S6savahd commented 8 years ago

It works well, but since it is hard to provide 100% match, we think we are done with this issue. The rest remains for manual checks.