au5ton / Roboruri

🤖 [offline] /u/roboragi but for Telegram
https://t.me/roboruri_bot
GNU General Public License v2.0
7 stars 0 forks source link

Incomplete/incorrect responses #6

Open Nihilate opened 7 years ago

Nihilate commented 7 years ago

I went through my own list of manual redirects and tried most of them to see what happened. Here are the ones that I found were incomplete or incorrect:

There are also some reasonably common short synonyms that might be requested by users, but aren't technically titles, like {Evangelion}, {Eva}, {Haruhi}, {Lain}, {Monster Musume}, {Umaru} and {Kaiji} (all of which are either wrong or only provide one of the two).

au5ton commented 7 years ago

It seems like some of these can be solved by verifying the media_type before consolidating the results between both data sources (seen here), which is next on the list after I confirm what Anilist and MAL return in their API. Currently, it just saves what it's given as a String instead of enumerating them. edit: implemented in 17ccee942ff3f7fb39c508980443c944e04334a7

On the other hand, one of the biggest problems I ran into was matching animes between datasources. Currently, it calculates which title the user is probably referring to by comparing the query among all the titles (in all formats) and saving the Assumed Real Title, then looks up animes with an exact match to the ART across all the search results. I decided to do an exact match because some titles are the same, except for subtle changes to their punctuation or names for different seasons and I didn't want to match the wrong season or show (examples being "K-On!", "Chuunibyou Demo koi ga Shitai!", and "Kissxsis"). I believe the repeated issue of results not recognizing on both datasources simulatenously is because I'm intentionally rejecting them. In addition, there were also cases where synonyms for shows just plainly didn't show up on one website versus another, one example being {Goku 2}, where MAL doesn't actually have a synonym listed for the anime you're referring to, which is a case where that's why it didn't match on MAL.

Those are my best guesses but I have yet to analyze them yet.

That said, this is still a problem, but I'm not sure how to solve it at the moment. I'll have to think about that. What do you think, @Nihilate ?

I definitely want to have the capability of manually adding synonyms for matching when Roboruri reaches production -- and {Days} is a prime example why -- but until then, I want to do everything I can to refine the searching algorithms in place.

Nihilate commented 7 years ago

One thing Roboragi does which will improve accuracy (and decrease performance a lot) is that he goes back to search the same source multiple times if he doesn't find any results (but finds new terms to use). For example:

Query = {nyanko-days} List of searchable terms: "nyanko-days" MAL - Romaji = "Nyanko Days", English = "Nyanko Days", Synonyms = None Anilist - Romaji = "nyanko-days", English = "nyanko-days", Synonyms = "Nyanko Days"

  1. Search MAL: "nyanko-days", no results, mark "nyanko-days" as "searched" for MAL
  2. Search ANI: "nyanko-days", found, add all new titles and synonyms to the list of searchable terms, mark "nyanko-days" as "searched" for MAL
  3. List of searchable terms = "nyanko-days", "Nyanko Days"
  4. Are there any data sources which a.) don't have a hit and b.) haven't exhausted all terms in the searchable terms list? Yes, MAL doesn't have a match yet and it hasn't searched "Nyanko Days" (from the Anilist synonym)
  5. Search MAL: "Nyanko Days", found, add all new titles and synonyms to the list of searchable terms (no new titles to add), mark "Nyanko Days" as "searched" for MAL

Doing this builds up a solid "web" of synonyms (especially as you add more data sources) and gives you much more accurate results at the cost of performance. My recommendation is to cache your results so you can take the hit the very first time something is searched and be lightning fast the next time.

Re: media type mapping, Roboragi doesn't current do much with it, but I've mapped what MAL and Anilist use to an enum in Acerola here and here if you want something to work off.

au5ton commented 7 years ago

I think that's an excellent idea! I considered that before, but didn't strongly consider it due to performance, but caching would make that worthwhile. I'll look into it.

Re: media type mapping, you just saved me a lot of time! :)

au5ton commented 6 years ago

I implemented a synonym/slang name database in 978196f0ec3ea4bfca86d323f74cfd759d24eb48 and I'm going through and checking which of your initial queries are fixed due to that and media_type confirmation in 17ccee942ff3f7fb39c508980443c944e04334a7. If it's crossed out, it's good now: