FabianBeiner / PHP-IMDB-Grabber

This PHP library enables you to scrape data from IMDB.com.
MIT License
271 stars 160 forks source link

Wrong IMDB_MOVIE_DESC regex pattern for movies without a description #166

Closed PrinceOfAbyss closed 3 years ago

PrinceOfAbyss commented 3 years ago

While I was testing some movies, I stumbled upon the extremely rare case where a movie lacks a description...

In that case, something very strange happens... At first, the pattern fails to match the default text that IMDB puts in place of the description (Know what this is about? Be the first one to add a plot.), which has to do with the newline symbols in the <a href markup of the text, so a totally unexpected fallback takes place, where the PCRE engine obviously arbitrarily activates the s flag, which then matches a whole block of text from within the page...

At this point, please take a careful look at this video I recorded for you, which is almost self explanatory... Notice @2:00 of the video how the scraped text

    [getDescription] => Array
        (
            [name] => Description
            [value] => Trelles diakopes tou thiriotrofeiou (1985) if ('csm' in window) { csm.measure('csm_body_delivery_started'); } Reference View | Change View 1h 12min | Comedy | 1985 (Greece) | Video | 0 Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9 Rate 10 Rate 0 Error: please try again. | Know what this is about? Be the first one to add a plot.
        )

can be seen at the very beginning of the matched text @1:20 of the video as soon as I deliberately activated the s flag.

Even if you google the term csm_body_delivery_started you will find a whole bunch of results were obviously movies which lacked a description where scraped, and no one noticed the wrong text that was brought by this or similar classes!

Now, at the good news, I've come up with the correct pattern that fixes this bug.

'~<section class="titlereference-section-overview">\s+<div>\s*+(.*)\s*?</div>\s+<hr>\s+<div class="titlereference-overview-section">~Uis'

The above pattern matches (as you can see in the video) the description, as well as the text that IMDB puts in place of a description, which prompts the user to enter a description of their own.

After that, all that is left is that you preg_match something like Know what this is about? or https://contribute.imdb.com/updates?update= in order to identify the prompt instead of an actual description in order to return the $sNotFound text...

Please also notice that from what I suspect (though I can't confirm as I can't find any series without a description), so I simply take an educated guess based on the patterns that look similar, the same bug may be affecting IMDB_SERIES_DESC, if someone stumbles upon a series that lacks a description...

bla0r commented 3 years ago

Thanks for the detailed description 👍🏼 i will fix it.

fixed #167

FabianBeiner commented 3 years ago

@PrinceOfAbyss: Seriously, this was by far the best bug report this project ever saw. Thanks a lot!

@bla0r: Again, thanks for taking care. :)