denshoproject / namesdb-editor

Other
0 stars 0 forks source link

Odd results from `namesdb searchmulti` #47

Open gjost opened 7 months ago

gjost commented 7 months ago

@GeoffFroh: I just got some odd results out of the namesdb searchmulti command. Here’s the call:

(names) ddr@kyuzo:/media/qnfs/kinkura/working/ireilaunch$ namesdb searchmulti./ddr-manz-4-persons.csv --sql > ./ddr-manz-4-persons-results.csv

This collection — ddr-manz-4 -- is the one with all the photos of Rev. Shinjo Nagatomi. In the entity metadata, his name appears as: "Nagatomi, Shinjo". His person record in the NR database (id: 88922/nr009tb36) is the same so it would seem like the search should return the record; but the output from namesdb searchmulti is this:

"ddr-manz-4-1","Nagatomi, Shinjo","0","Kiyomatsu Tani","88922/nr003wb7p","-28.54437565963006",...

Here’s the full output (ddr-manz-4-persons-results-sql.csv):

objectid,namepart,n,preferred_name,nr_id,score,matching,sample
ddr-manz-4-1,Nagatomi, Shinjo,0,Kiyomatsu Tani,88922/nr003wb7p,-28.54437565963006,,namepart: Nagatomi, Shinjo | nr_id: 88922/nr003wb7p
ddr-manz-4-3,Nagatomi, Shinjo,0,Kiyomatsu Tani,88922/nr003wb7p,-28.54437565963006,,namepart: Nagatomi, Shinjo | nr_id: 88922/nr003wb7p
ddr-manz-4-4,Nagatomi, Shinjo,0,Kiyomatsu Tani,88922/nr003wb7p,-28.54437565963006,,namepart: Nagatomi, Shinjo | nr_id: 88922/nr003wb7p
ddr-manz-4-5,Nagatomi, Shinjo,0,Kiyomatsu Tani,88922/nr003wb7p,-28.54437565963006,,namepart: Nagatomi, Shinjo | nr_id: 88922/nr003wb7p
...

It looks like it’s not just that particular name; Iwata, Jack is returning Allan Tomio Mizuhara, and Hori, Tashi returns Sonoko Kondo FWIW, I can search these strings directly from the NR Editor admin and get the expected names. Note: I have to omit the , char, or I get no results — i.e., Hori Tashi returns the record; Hori, Tashi returns zero results. I’m assuming this is something specific to the default django admin search config. This works: http://namesdbeditor.local/admin/names/person/?q=Hori+Tashi This does not work: http://namesdbeditor.local/admin/names/person/?q=Hori%2C+Tashi (edited)


Update: ddrnames load moved to https://github.com/denshoproject/ddr-cmdln/issues/241

GeoffFroh commented 7 months ago

The command did appear to generate expected results when it was run back in Feb and Mar of last year. See results csvs in /media/qnfs/kinkura/working/names

gjost commented 7 months ago

re: ddrnames load, this command is a bit unusual. For whatever reason, I wrote it with a --save arg and a --commit arg. Without those args it just prints stuff out to STDOUT. Update: ddrnames load issue moved to https://github.com/denshoproject/ddr-cmdln/issues/241

gjost commented 7 months ago

@GeoffFroh Can you post ddr-manz-4-persons.csv from the initial call? I can't duplicate this without that source CSV

gjost commented 7 months ago

I'm not sure we can do anything about this one. It looks like Kiyomatsu Tani is just what we're getting back from the SQLite fulltext search.

I hacked namesdb searchmulti to print out debugging info including the SQL statements:

(names) ddr@densho101dev:/opt/namesdb-editor$ namesdb searchmulti /tmp/ddr-manz-4-persons.csv --sql
"objectid","namepart","n","preferred_name","nr_id","score","matching","sample"
item={'namepart': 'Nagatomi, Shinjo', 'oid': 'ddr-manz-4-1', 'fieldname': 'persons'}
fulltext search
fulltext_search_sql
sql='SELECT rowid, rank FROM names_person_fts("nagatomi shinjo")'
sql2='SELECT rowid, * FROM names_person_fts WHERE rowid IN (4750)'
"ddr-manz-4-1","Nagatomi, Shinjo","0","Kiyomatsu Tani","88922/nr003wb7p","-28.54437565963006","","namepart: Nagatomi, Shinjo | nr_id: 88922/nr003wb7p"

For whatever reason, SQLite's FTS algorithms seem to think that's the best match for a search on nagatomi shinjo:

(names) ddr@densho101dev:/opt/namesdb-editor$ python src/manage.py dbshell --database names
SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite> SELECT rowid, rank FROM names_person_fts("nagatomi shinjo");
4750|-28.5443756596301
sqlite> SELECT rowid, * FROM names_person_fts WHERE rowid IN (4750);
4750|88922/nr003wb7p|Tani|Kiyomatsu|||||||Kiyomatsu Tani
sqlite> SELECT rowid, rank FROM names_person_fts("nagatomi");
4750|-14.093919355483
4751|-14.093919355483
4752|-14.093919355483
4753|-14.093919355483
sqlite> SELECT rowid, rank FROM names_person_fts("shinjo nagatomi");
4750|-28.5443756596301
sqlite> SELECT rowid, * FROM names_person_fts WHERE rowid IN (4750,4751,4752,4753);
4750|88922/nr003wb7p|Tani|Kiyomatsu|||||||Kiyomatsu Tani
4751|88922/nr003wb8c|Tani|Misao|||||||Misao Tani
4752|88922/nr003wb92|Tani|Yasujiro|||Joe||||Yasujiro Joe Tani
4753|88922/nr003wc0h|Tani|Aya|||||||Aya Tani

Unfortunately I think this may just be what we get. We need humans in this loop.

gjost commented 7 months ago

FWIW, I can search these strings directly from the NR Editor admin and get the expected names.

The search in the Django Admin is probably not using SQLite FTS.

I could maybe add --fulltext/--boolean arg pair so you had the option to do a boolean search if the fulltext algo doesn't do what you want?

GeoffFroh commented 7 months ago

Perhaps something bad happened to the SQLite fulltext index? Maybe force a reindex?

gjost commented 7 months ago

That's exactly what it was. The answer was right in namesdb searchmulti -h. Running the following asddr` worked in my local:

sqlite-utils disable-fts db/namesregistry.db names_person

and then

sqlite-utils enable-fts --fts5 db/namesregistry.db names_person nr_id \
    family_name given_name given_name_alt other_names middle_name \
    prefix_name suffix_name jp_name preferred_name
gjost commented 7 months ago

Ran the commands against the canon db on kyuzo and it seems to have worked.