YAMJ / yamj-v3

Main Project for YAMJ v3
11 stars 7 forks source link

add a job to select suspected doublet and screen / API to solve #274

Open jluc2808 opened 8 years ago

jluc2808 commented 8 years ago

i alway find doublet in the person database and even with an handling cleaning i regulary solve around 20 doublet by week i understand that doublet are created wit a lot of reason after a long search , i have found a way to suspect doublet , the only field which could give a real way to have a doublet is the name field in person database , this one agregate lest name and first name (without identifier which is by construction unique) this wouldn't solve all the cases but with my database (26000 person) give around 95% of the suspected doublet

by example for this 2 entries in the person database

'28964', '2016-01-15 07:45:07', '3', '2016-03-05 19:00:45', 'Stéphanie Pillonca est une réalisatrice et une actrice française.', NULL, 'Stéphanie Pillonca-Kervern', NULL, NULL, NULL, 'DONE', 'Stéphanie', 'Stephanie Pillonca-Kervern', 'Pillonca-Kervern', '2016-01-15 07:48:54', 'Stéphanie Pillonca-Kervern', '0', 'DONE', NULL

'30669', '2016-03-05 19:00:46', '2', '2016-03-05 19:05:05', 'Stéphanie Pillonca est une réalisatrice et une actrice française.', NULL, 'Stéphanie Pillonca-Kervern', NULL, NULL, NULL, 'DONE', 'Stéphanie', 'Stephanie Pillonca', 'Pillonca-Kervern', '2016-03-05 19:04:14', 'Stéphanie Pillonca-Kervern', '0', 'DONE', NULL

identifier field are respectivly : 'Stephanie Pillonca-Kervern' and 'Stephanie Pillonca' name field are: 'Stephanie Pillonca-Kervern'

so my purpose is to add a task which could create a list of suspected doublet and an API or screen which allow end-user to solve doublet by applying doublet API based on the suspected doublet

modmax commented 8 years ago

First the cause of this error: 1.) Movie A is scanned with allocine, and retrieves "Stephanie Pillonca" as actors and stores it with identifier "Stephanie Pillonca" 2.) Movie B is scanned with another scanner but person "Stephanie Pillonca-Kervern" and stores it with identifier "Stephanie Pillonca-Kervern" 3.) Later on the allocine person scans both persons and set the correct name in both entries; so you have 2 persons, with different identifiers but same name ...

The problem is, that there is no unique ID for each person within every application; also the name is often not correct for the same person, i.e. 50 Cent, Curtis Jackson, Curtis "50 Cent" Jackson and so on ...

Just a doublet detection will not work; first there must be mechanism how duplicates can be stored, perpaps an own table with matching from "doublet identifier" to "correct identifier" so that later scans can use this information and find the correct person.

Further on: If person A is marked as doublet of Person B, then the associations fo videos must be adjusted ... but I think that needs a rework of the current handling

jluc2808 commented 8 years ago

as you discribe , we couldn't solve doublet while scanning , so the suggested doublet table is somehow the best way to resolve doublet in case of already doublet found but that table doesn't solve the doublet not already solved

i just ran a little script to find doublet in my database with a comparison of name and i find 500 doublet so we need a mecanism that scan for doublet and ask user for acknoledge