Closed S6savahd closed 8 years ago
For the example in Vol-6
<http://ceur-ws.org/Vol-6/#CoeneniVA> a foaf:Person ;
foaf:made <http://ceur-ws.org/Vol-6/#paperX51>,
<http://ceur-ws.org/Vol-6/#paperX67> ;
foaf:name "Coenen VA" .
<http://ceur-ws.org/Vol-6/#CoeneniV> a foaf:Person ;
foaf:made <http://ceur-ws.org/Vol-6/#paperX66> ;
foaf:name "Coenen V"
The current implementation can achieve this
<http://ceur-ws.org/Vol-6/#CoeneniVA> a foaf:Person ;
foaf:made <http://ceur-ws.org/Vol-6/#paperX51>,
<http://ceur-ws.org/Vol-6/#paperX66>,
<http://ceur-ws.org/Vol-6/#paperX67> ;
foaf:name "Coenen V",
"Coenen VA" .
The removed name will be attached to the used name, and the additional informations for these two person are merged.
The only issue I considered is that the current same user name checking method is not able to always produce correct result, the method is in serializer.py.
How do you think @S6savahd @clange ?
what do you mean from "the current same user name checking method is not able to always produce correct result"?
The method will calculate string similarity, which sometimes may output 'same person' when they are not.
This is not good. Can you give some examples?
@S6savahd , the example in Vol-6 could be one, as it is hard to tell whether "Coenen V" and "Coenen VA" are the same person.
This is different than what you've reported. -in http://ceur-ws.org/Vol-6/ WG013 ---> Coenen V is the same person as Coenen VA. I assume text similarity finds them as 'same person', right?
Yakun said: The method will calculate string similarity, which sometimes may output 'same person' when they are not. any example that shows two different person names as 'same person'?
@S6savahd not same person, output same person
Michaelis M, Michaelis B, fuzz.WRatio('Michaelis M', 'Michaelis B') = 91
same person, output not same person
F. Baader and Franz Baader, fuzz.WRatio('F.Baader', 'Franz Baader') = 86
let's discuss this on Friday
if you can find patterns, let's try to split the text into given name and family name also handle initials as the special case.
after this we should be able to have an estimation on how many of them are wrong and try to fix them manually
@S6savahd I think a general pattern for name is space separated, I redefined the similarity checking of user name as follows:
- Comparing the whole name string similarity, if it is not less than 90, the return as same person, o/w continue
- split user name by white space delimiter, if two user have the same length, continue
- compare each part sequentially of two split result, if all parts of two user have similarity >= 89, then assert they are same person
A simple example is F. Baader
and Franz Baader
, where
WRatio('F. Baader', 'Franz Baader') = 82
, which is less than 90, continuelength(['F.', 'Baader']) == length(['Franz', 'Baader'])
WRatio('F.', 'Franz') = 90
,WRatio('Baader', 'Baader') = 100
then we say F. Baader and Franz Baader
are the same person, it is not easy to define a good threshold for similarity, 90 is currently used, and it is implemented in serializer.py.
I added one more checking at the beginning for the whole pattern, because if some of the user name is long, if we compare only by splitting on each separate part, then the whole comparison will be influenced by individual separate part too much.
It works well, but since it is hard to provide 100% match, we think we are done with this issue. The rest remains for manual checks.
-in vol-1 F. Baader and Franz Baader -in vol-1 A. Düsterhöft and A. Dusterhoft