YAMJ / yamj-v2

Yet Another Movie Jukebox (YAMJ) v2
GNU General Public License v3.0
28 stars 11 forks source link

REGEX for tv/movie detection in a config file.. #993

Open Omertron opened 9 years ago

Omertron commented 9 years ago

Original issue 994 created by Omertron on 2009-09-13T19:18:53.000Z:

I would love to be able to convert all my regex for tv and movie show detection. It would be great if a user could add to the detection options and not just hardcoded..

Please add a method to add additional regex detection to YAMJ via a config file or some external method. :)

Omertron commented 9 years ago

Comment #1 originally posted by Omertron on 2010-02-28T08:46:27.000Z:

I think you are talking about the S00E00 and 0x00 formats? Can you provide an example or two?

Omertron commented 9 years ago

Comment #3 originally posted by Omertron on 2010-03-17T21:42:10.000Z:

These are the exact checks dentedboxes/phpYAMJ can do. I have others for full path if you have a way to deal with duplicate basenames:

I haven't revised this code to remove the ? parts yet, its a php 5.24+ thing but you can just pull that part out and it wont effect the regex at all.. The regex is based on perl regular expression...

NOTE: its designed for the filename to already have data you care about removed from it and doesn't try to find an episode name.. but leave the year in tact it does make some of the episode numbering possible..

I clean the filename by changing ., - and _ to spaces just before I run this..

The regex came from a collection of renamers and auto-recording default formats.. it should work with scene names since those are just the auto-recording followed by auto-posting..

Year can be 2009, (2009), [2009] and 4 digits long starting with 19 or 20.. There must be something between the name and the year like a ., _, - or space..

$tvexpr = array("/^(?.+).((?[19|20]\d{2}))",
"/^(?.+).[(?[19|20]\d{2})]", "/^(?.+).(?[19|20]\d{2})", "/^(?.+)");

// se checks this order is important for false positives..

s##e## S##xE## S##.E## S## E##

x

-

# x #
### (first digit season)
   $seexpr = array(".s(?<season>\d{2})e(?<episode>\d{2})",
                   ".s(?<season>\d{2})xe(?<episode>\d{2})",
                   ".s(?<season>\d{2}).e(?<episode>\d{2})",
                   ".s(?<season>\d{2}) e(?<episode>\d{2})",
                   ".(?<season>\d+)x(?<episode>\d+)",
                   ".(?<season>\d+) (?<episode>\d+)",
                   ".(?<season>\d+)-(?<episode>\d+)",
                   ".(?<season>\d+) x (?<episode>\d+)",
                   "\b(?<season>\d{1})(?<episode>\d{2})\b");

The arrays are run in a double loop...

  loop1 cycles the name/year formats.
  loop2 adds the season numbering to the string.

The year/time starts with a delimiter of / and not shown the loop adds another
delimeter of /i to the end of the string for case insensative checking per perl regex
rules to catch any combination of S E, s e, S, e etc..  The ^ forces the first
character of name to be the start of the string.  I've been meaning to go back and
rework the name regex but I may just skip it because its been rocksolid so far.

FALSE POSITIVE WARNING:
in all my test cases, I only ran into 1 movie where it came up a tv show and it
wasn't..  Iron Maiden Flight 666 came up as a tv show s6e66.  The way I handle it is
if I can't find the tv show name with a scraper, the S and E are put back into the
name and the search is done against the movie sites and if I get a hit, I switch it
from a tv show to a movie and continue scraping..

#### with first 2 digits the season is not detected, i ran into a little word boundry
issue on windows computers with some versions of php and I haven't had time to
revise..  the year is specificially designed around the idea that movies need to have
a year from 1900 to 2099 so technically you should be able to catch episodes outside
of that range and only simpsons, some soap operas, and some longer running uk shows
like masterchef or dr who might have a problem..

If I get a hit on any of these regex's, I go sperately extract multi-season, episodes..  

the last thing I do is also if there was a hit to take the original cleaned name and
starting at the character after what the regex got a hit on look for -episodename
which I only use if I can't scrape a name (actually its users choice)..  EX: himom
s01e02 fjldsfjldjlfj -name, I take the substring that the php preg_match match
returns to find out that the match was 12 characters long so I look for a - starting
at character 13 and take everything after it as the filename..

I've been considering adding suffix's to look for the name - s#e# - data -episode
name but I ahven't had time and it doesn't appear to get in the way of epsiode name
per some of the renamers that default to that format (like sabnzbd)..
Omertron commented 9 years ago

Comment #4 originally posted by Omertron on 2010-03-18T09:19:02.000Z:

Thanks Accident, I've update the REGEX to match those season/episode combinations, so it should be a lot better.

This is a useful page to test out REGEX for java http://www.regexplanet.com/simple/index.html

I've also updated the TVSeriesNaming wiki page with the new formats