Rapptz / jimaku

A site for hosting (Japanese) subtitles
https://jimaku.cc
GNU Affero General Public License v3.0
58 stars 3 forks source link

[API `/files`] The `episode` parameter seems to not always match files with a JPN name #20

Closed ThisIsntTheWay closed 3 months ago

ThisIsntTheWay commented 3 months ago

When selectively searching for subs using the episode parameter, files are likely not returned if they use the Japanese name of the show.
This behavior can be primarily observed with live action entries, whose files usually follow this naming scheme:
<show_name_jpn>#<episode_number>.srt

An example using the following entry:

{
  "id": 4583,
  "name": "#Remolove: Futsuu no Koi wa Jado",
  "english_name": "#Remolove",
  "japanese_name": "#リモラブ 〜普通の恋は邪道〜"
}

Looking at the files (list truncated), only subs with Japanese titles are available.

[
  {"name": "#リモラブ~普通の恋は邪道~#10.srt"},
  {"name": "#リモラブ~普通の恋は邪道~#03.srt"},
  {"name": "#リモラブ~普通の恋は邪道~#05.srt"}
]

Searching for episode number 3 using the API doesn't return anything.

As a test, I reuploaded the subtitle for episode 5 using the full English name:
#Remolove Futsuu no Koi wa Jado #05.srt

Now, searching for episode 5 actually returns something:

[
  {"name": "#Remolove Futsuu no Koi wa Jado #05.srt"}
]

Interestingly enough, doing the same with the following entry...

{
  "id": 240,
  "name": "2.43: Seiin Koukou Danshi Volley-bu",
  "english_name": "2.43: Seiin High School Boys Volleyball Team",
  "japanese_name": "2.43 清陰高校男子バレー部"
}

...which has files with both the English and Japanese names of the show, it actually works fine.
Here's what you get back when searching for episode 5:

[
  {"name": "2.43.清陰高校男子バレー部.S01E05.スタンド・バイ・ミー.WEBRip.Netflix.ja[cc].srt"},
  {"name": "Ni Ten Yonsan Seiin Koukou Danshi Volley-bu 005.srt"}
]

This led me to assume that the S01E05 substring is the key for a successful match, but turns out that might not necessarily be the case:
As a test, I reuploaded yet another subtitle of entry 4583 using the JPN name, but this time with a season/episode format:
#リモラブ~普通の恋は邪道~#S01E03.srt

Unfortunately, searching for episode 3 still returns [].

I am aware that matching is done on a best effort basis, but perhaps the filtering mechanism could be improved?
Other example entries to test with: 2445, 4420

Rapptz commented 3 months ago

This is powered by a parser I (re)wrote called anitomy-rs. Modifying this parser is a bit of an annoyance, hence that it's provided as a best-effort basis.

This probably belongs on that repository and not here though, even though the API is using it. But it's okay to just leave it here since it's already been made.


After some quick investigation, the problem with these files is that the # is actually a full-width number sign rather than a normal width one. The naive fix for this (which is just including the full width #) does fix it but it breaks other things due to how brittle filename parsing is.

Rapptz commented 3 months ago

This is fixed now.

Systematically this required a few changes:

  1. The anitomy-rs parser had to be aware of full-width #, which was not detected before.
  2. The lack of space between the elements made it so the parser could not tokenize the string properly which led to the weird behaviour you were describing. In order to fix this, a space had to be prepended to the full-width # in order to make the parser properly detect the episode number.

I went ahead and ran a script to update all 19,550 failing subtitles with the new names. Going forward, new subtitles scraped from jpsubbers will automatically have the space added so this doesn't happen in the future.