Remove "based on" text from AniDB descriptions

rcdailey commented 6 years ago

Series description for almost all anidb TV shows looks something like this:

* Based on a comedy seinen manga series written and illustrated by Ootake Masao.

One night, a strange object falls on the head of Nitta, a member of the yakuza. Inside the box is a strange young girl named Hina. She has tremendous supernatural powers, and Nitta finds himself reluctantly taking her in. Her powers can come in handy for his yakuza business, but he also runs the risk of her using them on him! Not to mention, if she doesn't use her powers, she will eventually go berserk and destroy everything around her. Nitta and Hina's strange life together is just beginning...

(Link)

It would be nice if the * Based on a comedy seinen manga series written and illustrated by Ootake Masao. as well as any blank lines after it get stripped by the agent. Reason I'd like to see it removed is that it pushes the meaningful description down too far when viewing it through Plex. It's basically a whole 2 lines of wasted space.

Not sure if you agree or not, or whether or not this could be reliably done with some sort of regex (maybe any lines starting with *?).

ZeroQI commented 6 years ago

@rcdailey I suppose i can make it an agent setting option: 'AniDB sumnary note removal'. Will do at the week end

Are you using a phone/tablet as the web page should have pleinty space?

purposelycryptic commented 6 years ago

I don't suppose you could have it remove Summary Source Citation and Special Preview Airing Note too? i.e.,

Uranohoshi Girls' High School, a private school in the seaside neighborhood of Uchiura at Numazu city, Shizuoka prefecture.

A small high school in a corner of Suruga Bay, it is home to nine teens, led by second-year student Chika Takami, driven by one seriously big dream:

To become the next generation of bright, sparkling "school idols"!

As long as we don't give up, any dream can come true... All we have to do now is keep pushing hard for glory!

Now their "School Idol Project" begins to make their dreams come true!

Source: crunchyroll

Note: The first episode received an early screening at a special event on 27.03.2016, first and second episodes - on 02.04.2016. The regular TV broadcast started on April 7, 2016.

(Link)

They are both always at the end, always in that order, and always follow that 'Source: XXXXXXXX' and 'Note:YYYYYYY' format, so you could probably just wipe out everything including and after the 'Source:' part - some sort of RegEx like this, maybe?

/(\nSource:)\w[\w\W]/

There is probably a more elegant formulation, but my RegEx is terrible and my brain tired, and that seemed to nuke everything pretty well.

The Plex WebUI just allocates way too little space for series' summaries (See here for example) :-/

sven-7 commented 6 years ago

Interesting idea. That’s exactly why I have it set to order TVDB series descriptions over AniDB. If the AniDB ones could get cleaned somehow, that’d be a clever feature.

EndOfLine369 commented 6 years ago

All you need is "^(\* Based on |Source: |Note: )" where 'string' = '\<description>' tag text. And used in something like the below line. "\n".join([line for line in string.split("\n") if not re.search("^(\* Based on |Source: |Note: )", line, re.IGNORECASE) ]) EX:

>>> string = """* Based on a four-panel romantic comedy manga by http://anidb.net/cr52647 [Nekoume].
... http://anidb.net/ch92320 [Shiina Aki] is constantly being treated like a girl due to his feminine looks so he decides to move to Tokyo and attend middle school in an attempt to change himself.
... However what awaits him in his new home, Sunohara-sou, is the kind-hearted caretaker, http://anidb.net/ch92318 [Sunohara Ayaka] and three female members of Aki`s new middle school`s student council, named http://anidb.net/ch95079 [Yukimoto Yuzu], http://anidb.net/ch95078 [Yamanashi Sumire], and http://anidb.net/ch95080 [Kazami Yuri].
... And so begins Aki`s new life of living with four girls in Tokyo.
... Source: M-U
... Note: The complete edition with something"""
>>>
>>> "\n".join([line for line in string.split("\n") if not re.search("^(\* Based on |Source: |Note: )", line, re.IGNORECASE) ])
'http://anidb.net/ch92320 [Shiina Aki] is constantly being treated like a girl due to his feminine looks so he decides to move to Tokyo and attend middle school in an attempt to change himself.\nHowever what awaits him in his new home, Sunohara-sou, is the kind-hearted caretaker, http://anidb.net/ch92318 [Sunohara Ayaka] and three female members of Aki`s new middle school`s student council, named http://anidb.net/ch95079 [Yukimoto Yuzu], http://anidb.net/ch95078 [Yamanashi Sumire], and http://anidb.net/ch95080 [Kazami Yuri].\nAnd so begins Aki`s new life of living with four girls in Tokyo.'
>>>

purposelycryptic commented 6 years ago

@EndOfLine369 I appreciate you posting a nice and clean implementation, though I found myself unable to successfully get it working - this being my first attempt at anything Python :-| -and also, and I could very very well be wrong here, so my apologies if I am, but I think it should be single quotes around the Regex part, i.e.,:

"\n".join([line for line in string.split("\n") if not re.search('^(\* Based on |Source: |Note: )', line, re.IGNORECASE) ])

Anyway, I ended up just integrating it into the "Internal AniDB Link Removal" function ininit.py, as:

description = re.sub(r'http://anidb\.net/[a-z]{1,2}[0-9]+ \[(.+?)\]', r'\1', re.sub(r'^\* B.*\n+|\nSource:\w*[\w\W]*|\nNote:\w*[\w\W]*()', "", getElementText(anime, 'description'))).replace("`", "'")

And, since there was a highly similar function in AniDB.py, the same there, just in case:

> if SaveDict( re.sub(r'http://anidb\.net/[a-z]{1,2}[0-9]+ \[(.+?)\]', r'\1', re.sub(r'^\* B.*\n+|\nSource:\w*[\w\W]*|\nNote:\w*[\w\W]*()', "", GetXml(xml, 'description'))).replace("`", "'"), AniDB_dict, 'summary')

And it seems to work as intended without affecting the original function, i.e.:

Raw AniDB Summary:

Processed AniDB Summary:

It's mainly beneficial when the summary isn't expanded, as in this (Pre-Change): Vs. this (Post-Change):

Since I don't really know what 'm doing, I wasn't entirely sure how your implementation would affect the paragraph spacing, whether it would remove all blank lines, or leave the ones before/after the cut parts, or end up the same as mine (remove before/after cuts at bottom/top, respectively, but leave paragraph breaks). I really wanted to preserve the paragraph breaks but remove the rest, and since I couldn't get your version to run until I had essentially finished mine (And realized the single-quote RegEx issue), here we are.

Great learning experience :-) ...although I spent at least two hours trying to trouble-shoot the stupid thing because I didn't know that Python takes issues with Tab indentation, which Notepad++ was automatically inserting... So, yeah... like I said, this is all new to me.

Edit So, it turns out some series have a 'Note:' section without a 'Source:' section - I've updated the code to reflect that. Edit 2 Apparently I'm terrible at copy/pasting things, leading to text not being properly replaced with nothingness. Fixed now.

ZeroQI commented 6 years ago

@purposelycryptic Thanks for the detailed post. Processed AniDB Summary is actually TheTVDB one: https://www.thetvdb.com/series/library-war

In the year 2019, the explosion of information and misinformation became a direct threat to society. In a daring decision, it was decided to create a new government agency dedicated solely to information management. Now some thirty years later, the government still monitors and controls information, suppressing anything they find undesirable, but standing against their abuses of power are the libraries, with their special agents called ‘the book soldiers.’

I have included the changes:

if SaveDict(summary_sanitizer(GetXml(xml, 'description')), AniDB_dict, 'summary') and not movie and Dict(mappingList, 'defaulttvdbseason').isdigit() and mappingList['defaulttvdbseason'] in media.seasons:
          SaveDict(AniDB_dict['summary'], AniDB_dict, 'seasons', mappingList['defaulttvdbseason'], 'summary')

And made this function for clarity

def summary_sanitizer(summary):
  summary = summary.replace("`", "'")                                                      # Replace backquote with single quote
  summary = re.sub(r'http://anidb\.net/[a-z]{1,2}[0-9]+ \[(.+?)\]',       r'\1', summary)  # Replace links
  summary = re.sub(r'^\* B.*\n+|\nSource:\w*[\w\W]*|\nNote:\w*[\w\W]*()', "",    summary)  # Remove Source and M=Notes
  return summary

Pro-actively closing. If not totally resolved, reopen and comment please

Note:

it need to be consistent: tab or double or quadruple spaces
single or double quotes for strings, outer quotes allow the other quote type without the need for escaping them

ZeroQI / Hama.bundle

Remove "based on" text from AniDB descriptions #232