comictagger / metron_talker

Metron.cloud comictagger talker plugin
5 stars 3 forks source link

File name parsing for Merton search problems #21

Open beville opened 7 months ago

beville commented 7 months ago

Looks like Metron search based a parsed file name (auto-tagging) is failing some cases.

A series title with colons (:) and slashes (/) will of have those replaced with space-minus-space (-) in a filename. A great example is "Batman / Superman: World's Finest" which might have a filename with Batman - Superman - World's Finest (or even one without the ' in it) The minuses cause the Metron search the search to fail.

Probably CT just needs to remove minus/dash characters (-) from the search string before submitting it.


Another issue I think is probably server-side, and I should probably make a issue with @bpepple over on his repo. Metron doesn't like a missing apostrophe in some cases.

So when the search string is:

Cory Doctorow's Futuristic Tales of the Here and Now it works Cory Doctorow s Futuristic Tales of the Here and Now also works Cory Doctorow Futuristic Tales of the Here and Now also works

but for some reason

Cory Doctorows Futuristic Tales of the Here and Now fails.

Unfortunately for auto-tagging, it's pretty common to see the dropped apostrophe. I can't think of a good client-side solution for that one, though, but maybe you all have an idea.

Thanks!

bpepple commented 7 months ago

Probably CT just needs to remove minus/dash characters (-) from the search string before submitting it.

Yeah, I think I mentioned in #17 that sanitizing the query string would greatly help with matching.

Another issue I think is probably server-side, and I should probably make a issue with @bpepple over on his repo. Metron doesn't like a missing apostrophe in some cases.

Possibly using SearchVector/SearchRank or a Trigram Similarity in Django/PostgreSQL may improve results, but there is a significant performance penalty to be paid.

Might be worth testing to see if it gives improved results.

mizaki commented 7 months ago

I was all set to tell you it does sanitise and then I checked and it does not. I'm wonder what the reason was... I think it may have been because Metron was returning some tests without it but obviously there is still some sanitising needed.

bpepple commented 7 months ago

I was all set to tell you it does sanitise and then I checked and it does not. I'm wonder what the reason was... I think it may have been because Metron was returning some tests without it but obviously there is still some sanitising needed.

Yeah, it's much easier on my end to see what metron_talker is submitting for a series name. 😉

mizaki commented 6 months ago

I tried using the same as CT but that causes the ' problem so I've done everything the same as CT but remove the ' without a space in #24 if you want to give a try on any you had trouble with.

beville commented 6 months ago

Seems to work better for some titles!

bpepple commented 5 months ago

Another issue I think is probably server-side, and I should probably make a issue with @bpepple over on his repo. Metron doesn't like a missing apostrophe in some cases.

So when the search string is:

Cory Doctorow's Futuristic Tales of the Here and Now it works Cory Doctorow s Futuristic Tales of the Here and Now also works Cory Doctorow Futuristic Tales of the Here and Now also works

but for some reason

Cory Doctorows Futuristic Tales of the Here and Now fails.

Unfortunately for auto-tagging, it's pretty common to see the dropped apostrophe. I can't think of a good client-side solution for that one, though, but maybe you all have an idea.

Looked into this a bit more today, and played around with Postgresql's Trigram Similarity Matching, which helps deal with the apostrophe matching issues. Unfortunately, Django's support isn't implemented with Transform, so I'd lose support for Unaccent and other lookup options, unless I made some hand-crafted artisanal SQL statements (which I don't really have the time to do), so this isn't a great solution.

mizaki commented 5 months ago

Seems to work better for some titles!

I've been a bit slow on the talker front but if you have any examples of problem titles I'll see what I can do (and put in some test).

beville commented 5 months ago

I think the classic problem title in this vein is:

"Batman/Superman: World's Finest"

which includes a slash, a colon, and an apostrophe.

Filenames might have the slash and colon replaced with a space or a "-". The apostrophe often seem to typically be just removed. (I can't remember if that's an problem character on Windows filesystems?)

"Batman - Superman- Worlds Finest"

I think in general the apostrophe (in English anyways) is most problematic for filename-to-search, since it tends to be replaced with nothing rather than a space in some filenames.