comictagger / gcd_talker

A Grand Comics Database talker for Comictagger
Apache License 2.0
2 stars 1 forks source link

Can't find "Cory Doctorow's Futuristic Tales of the Here and Now" #4

Closed lordwelch closed 6 months ago

lordwelch commented 7 months ago

When searching for "Cory Doctorow's Futuristic Tales of the Here and Now" nothing is returned, searching for "Cory Doctorow" does return the correct result.

SQLite has a full text search option which is probably useful, I checked python 3.8-3.12 and pypy3 and they all include the FTS5 extension. You could also check if it is supported first https://stackoverflow.com/a/36663390

The literal search also doesn't work because it removes the ' from the search.

mizaki commented 7 months ago

I did have a look at FTS and decided against it and went with adding %'s instead for the search. Both problems come down to the ' being removed. Removing utils.sanitize_title(series_name, literal) should fix both issues.

The other option is to replace ' and " (maybe some more) with % the same as any spaces.

mizaki commented 7 months ago

https://github.com/mizaki/gcd_talker/commit/97daa5f98ec5a3fe0242a25702476844285b9227 should have fixed both issues.

I had another look at FTS and it wouldn't help with the ' issue and I don't see it helping unless wanting to say search over the descriptions too. Which I don't think is all that helpful?

lordwelch commented 7 months ago

The exact search that I opened this for now works. But "Cory Doctorows Futuristic Tales of the Here and Now" doesn't work which is just as problematic.

Also auto-identify doesn't appear to get any results when a year is specified but the screenshot in #5 shows that it has the correct year. Here is the log. ComicTagger.log

Also using FTS would make this better as it handles all of the different forms that I can come up with.

Create the table and populate

CREATE VIRTUAL TABLE ft USING fts5(name, content='gcd_series', content_rowid='id', tokenize = 'porter unicode61 remove_diacritics 1') ;
INSERT INTO ft(ft) VALUES('rebuild');

Search: Cory Doctorow's Futuristic Tales of the Here and Now (fts needs special characters ' quoted sqlite escapes ' by doubling '')

SELECT gcd_series.id AS 'id', gcd_series.name AS 'series_name', 
gcd_series.sort_name AS 'sort_name', gcd_series.notes AS 'notes', 
gcd_series.year_began AS 'year_began', gcd_series.year_ended AS 'year_ended', 
gcd_series.issue_count AS 'issue_count', gcd_publisher.name AS 'publisher_name'
FROM ft('Cory "Doctorow''s" Futuristic Tales of the Here and Now') 
LEFT JOIN gcd_series on ft.rowid=gcd_series.id
LEFT JOIN gcd_publisher ON gcd_series.publisher_id=gcd_publisher.id ;

Search: Cory "Doctorow s" Futuristic Tales of the Here and Now ("Doctorow s" is a single term)

SELECT gcd_series.id AS 'id', gcd_series.name AS 'series_name', 
gcd_series.sort_name AS 'sort_name', gcd_series.notes AS 'notes', 
gcd_series.year_began AS 'year_began', gcd_series.year_ended AS 'year_ended', 
gcd_series.issue_count AS 'issue_count', gcd_publisher.name AS 'publisher_name'
FROM ft('Cory "Doctorow s" Futuristic Tales of the Here and Now') 
LEFT JOIN gcd_series on ft.rowid=gcd_series.id
LEFT JOIN gcd_publisher ON gcd_series.publisher_id=gcd_publisher.id ;

Search: Cory Doctorow s Futuristic Tales of the Here and Now ("Doctorow s" is two terms)

SELECT gcd_series.id AS 'id', gcd_series.name AS 'series_name', 
gcd_series.sort_name AS 'sort_name', gcd_series.notes AS 'notes', 
gcd_series.year_began AS 'year_began', gcd_series.year_ended AS 'year_ended', 
gcd_series.issue_count AS 'issue_count', gcd_publisher.name AS 'publisher_name'
FROM ft('Cory Doctorow s Futuristic Tales of the Here and Now') 
LEFT JOIN gcd_series on ft.rowid=gcd_series.id
LEFT JOIN gcd_publisher ON gcd_series.publisher_id=gcd_publisher.id ;

Search: Cory Doctorows Futuristic Tales of the Here and Now

SELECT gcd_series.id AS 'id', gcd_series.name AS 'series_name', 
gcd_series.sort_name AS 'sort_name', gcd_series.notes AS 'notes', 
gcd_series.year_began AS 'year_began', gcd_series.year_ended AS 'year_ended', 
gcd_series.issue_count AS 'issue_count', gcd_publisher.name AS 'publisher_name'
FROM ft('Cory Doctorows Futuristic Tales of the Here and Now') 
LEFT JOIN gcd_series on ft.rowid=gcd_series.id
LEFT JOIN gcd_publisher ON gcd_series.publisher_id=gcd_publisher.id ;
mizaki commented 7 months ago

Interestingly removing the ' makes it fail on GCD itself: https://www.comics.org/series/name/Cory%20Doctorows%20Futuristic%20Tales%20of%20the%20Here%20and%20Now/sort/alpha/

mizaki commented 7 months ago

I've added FTS in https://github.com/mizaki/gcd_talker/commit/b6252756b384147e0cd9aa59b82b7f9a0f33e4ea

Not sure on checking the fts and fts_docsize tables to see if a rebuild is needed. I can't really see where it would be required.

lordwelch commented 7 months ago

It shouldn't be needed unless someone is doing some custom modifications to the db, you can also setup hooks to keep it up to date in case someone does

lordwelch commented 6 months ago

Thanks for fixing this!