RickDB / PlexAniSync

Sync Plex anime library to AniList
GNU General Public License v3.0
276 stars 44 forks source link

AniDB vs Anilist - add support for Movies and wo/o naming differences #126

Open karpik123 opened 2 years ago

karpik123 commented 2 years ago

I went through my library and synced everything. I use x-jat names from AniDB and I noticed two naming patterns that should be straightforward to cover, saving a lot of work on custom mappings.

First Pattern - 'Movie'

AniDB Name Anilist Name
Gekijouban Blood-C: The Last Dark BLOOD-C: The Last Dark
Gekijouban Mahouka Koukou no Rettousei: Hoshi o Yobu Shoujo Mahouka Koukou no Rettousei: Hoshi wo Yobu Shoujo
Gekijouban xxxHOLiC: Manatsu no Yo no Yume xxxHOLiC: Manatsu no Yoru no Yume
Gekijouban Dungeon ni Deai o Motomeru no wa Machigatte Iru Darouka: Orion no Ya Dungeon ni Deai o Motomeru no wa Machigatte Iru Darouka: Orion no Ya

PlexAniSync can recognise this word and attempt to do an extra attempt to match title after removing Gekijouban<space> from the string.

Another similar example is 'Eiga': AniDB Name Anilist Name
Eiga Crayon Shin-chan: Mononoke Ninja Chinfuuden Crayon Shin-chan: Mononoke Ninja Chinfuuden
Eiga Doraemon: Nobita no Little Star Wars 2021 Doraemon: Nobita no Little Star Wars 2021

Second Pattern - wo vs o

AniDB Name Anilist Name
Hige o Soru. Soshite Joshikousei o Hirou. Hige wo Soru. Soshite Joshikousei wo Hirou.
Sono Bisque Doll wa Koi o Suru Sono Bisque Doll wa Koi wo Suru
Seishun Buta Yarou wa Yumemiru Shoujo no Yume o Minai Seishun Buta Yarou wa Yumemiru Shoujo no Yume wo Minai
Nakitai Watashi wa Neko o Kaburu Nakitai Watashi wa Neko wo Kaburu
Fune o Amu Fune wo Amu

AniDB is almost universally done as o, while Anilist uses wo in titles. I don't know Japanese well enough to understand why... PlexAniSync can catch <space>o<space> in the string and do an extra attempt to match title after convering o into wo. Note top example from the table even has double o. While some titles might genuinely use o in the title, I don't expect them to be a match to a completely different title even if PlexAniSync converts innocent o into wo.

karpik123 commented 2 years ago

I got my hands on AniDB title .xml.gz file and did some top level counting. I discarded all lines from xml except lang="x-jat" and type="main".

I was left with 593 titles:

Numbers don't add up as Eiga + o or Gekijouban + o happen sometimes.

I did this to do more data checks and to confirm the logic won't be harmful. I spotted some odd cases, please read on.

The wo->o rule

The overwhelming number of examples would be perfect if o became wo.

Some oddities:

Gekijouban rule

Some medium disappointment here, I have to go back on my initial assumption.

Here are examples where gekijouban-less title will match to tv show of the same name:

Funny outlier: Gekijouban Idol Bu Show, anidb: 17230 is https://anilist.co/anime/145916/IDOL-bu-SHOW-Movie/ but there's no tv show covering the name.

Eiga rule

Not as much as Gekijouban case, but I can find similar issues.

Here are examples where eiga-less title will match to tv show of the same name:

Other oddities: Komadori Eiga Komaneko, anidb 7306 proves that Eiga needs to be matched from the beginning of the string.


Summary

Wo-ing the titles seems safe and desired.

While all previous examples from my own library would match correct anilist title (after de-gekijoubaning or de-eigaing), there seem to be too many cases where it will cause problems.

Instead, I think it's safer to attempt to do following treatment:

I attach file with cleaned titles I used for above research: https://gist.github.com/karpik123/760774de1a0a90156567d794a704e71a