duck7000 / imdbGraphQLPHP

IMDb GraphQL API PHP class
11 stars 1 forks source link

Request - Soundtracks update #29

Closed GeorgeFive closed 8 months ago

GeorgeFive commented 1 year ago

Would it be possible to expand the soundtracks function to include IMDb ids, and also clean up the output a bit?

Right now, it returns:

[0] => Performed by Alice in Chains
[1] => Composed by Alice in Chains
[2] => Produced by Alice in Chains

I'd like to see something like this:

[performed_by] = Alice in Chains
[performed_by_id] = 2750498
[composed_by] = Alice in Chains
[composed_by_id] = 2750498
[produced_by] = Alice in Chains
[produced_by_id] = 2750498

I could easily strip out the "Performed by" part with php, sure, but I just think it would be cleaner to let the user decide how to format it. I plan on doing a "Band Name - Song Name" list, and having the repeated "Performed by" would look silly in my use case.

Do note that ID is used in the vast majority of cases, but it can be null... example - https://www.imdb.com/title/tt0089907/soundtrack/

Finally... does IMDb return notes as a separate item? Could we have something like....

[performed_by]
    [0] => T.S.O.L.
[performed_by_id]
    [0] => 1830296
[notes]
    [0] => "Used in Hemdale's 1991 Release in lieu of Dead Beat Dance"

I see that notes are currently returned, but it is currently impossible to tell if an entry is a note or a credit.... [3] => Produced by Simon Heyworth [3] => (Used in Hemdale's 1991 Release in lieu of Dead Beat Dance)

I know it's a lot, so let me know what you think!

duck7000 commented 1 year ago

Mm that could be complicated.

Possibly this will cause the same problems as with imdbphp soundtrack, that was a terrible mess to disassemble all different parts, that's why i asked tboothman to finally simplified that method Imdb GraphQL gives soundtrack title and array comment, it can be plain text or HTML So there are no separate parts, so if you want to disassemble in parts i have to analyze that string, cut it in parts and store it in different array items. The big problem is that those parts can have like 10 different names like explained here https://help.imdb.com/article/contribution/titles/soundtracks/GKD97LHE9TQ49CZ7#

I will look in to this, but it is complicated. It should be a whole lot easier as IMDb separated that data in GraphQL but sadly they don't

duck7000 commented 1 year ago

Soundtrack title id is possible, all other id's is difficult as they appear only in html. If you want those i have to search in that html is there are anchor links in it..

The soundtrack title is separate and only in text

There are no credits, only array comments which includes credits and comments, so no i can't tell what is credit and what is comment. Every array element that i get from imdb appears to be one credit/comment line, so they are separated by line as displayed on imdb soundtrack pages

So i can change the GraphQL query to get comments in html but if i have to disassemble this in id's, name's etc it will be very hard

Screenshot_2023-10-22_17-17-22

This is screenshot from imdb GraphQL api comments contains everything like comments but also credits. The screenshot is from html version of comments witch contains anchor links from the artist (or performer, writer etc.) The plainText version does not contain those anchor links (like it is now)

So you see that my options are limited, the only option is to disassemble the html version but it can contain like more than 10 different names like performed by, written by, under courtesy, written and performed etc.

Look at the old version of soundtrack in imdbphp you will see what i mean https://github.com/tboothman/imdbphp/commit/9dd247233dbdc605b9804ca991e45dc8bc7bb86e

GeorgeFive commented 1 year ago

Hmm.... so I think this may be a little too much work for too little of a feature. I think it would be neat to have, but if it's too much (and it looks to be so), this may be something to put on the shelf until they (hopefully) update the way they return their data for us.

duck7000 commented 1 year ago

Yes for now this is not simple to do. It is not the amount work as i'm prepared to do the work, but it is not easy to change it to something you are after

I can add soundtrack title id if that helps (i guess not as that is not where you are after i think) (Imdb does not use that on the soundtrack pages)

duck7000 commented 1 year ago

if i had a list of all possible types (performed, written, courtesy, written and performed) etc that would help but there is no list to be found, so far i guess there are 10 - 14 different types but i don't know if there are any more. i do know that there are more types then in the old imdbphp soundtrack method is used.

Still it won't be easy but more possible..

GeorgeFive commented 1 year ago

With it being more complex than I anticipated, I think I'll leave this one up to you. If you want to put it the work and get it working smoothly, of course I wouldn't complain! But I do understand that this is a bit of a niche request that may not help most people, so I would also understand if you wanted to pass on it. I could more than likely handle the data as is on my end and do something with it, it was just the performer IDs that I was most concerned with.

duck7000 commented 1 year ago

Well it is not impossible to do, but it would be very complex. If you want to do it in your program/app you will find that it is not easy to cover all different cases/names.

If you want to add a soundtrack for example in imdb there is only a text field, there are no pre defined choices so it is up to the user what filled in in the textfield. There lies the biggest problem.

Like i said before i can change it to the html version and get the performer id and name (from the anchor link) but the exact title (written by, performed by etc) is the problem. And the output array would also get more complicated but that is the least problem. Not always is there a anchor link available so what if there is not? this will soon get a guessing game and i don't want to end up like imdbphp old soundtrack method as it is too complicated. And even that method did not cover all the cases. Simply assume that the word "by" is always sufficient as split point is not always true

So yes for now i will leave it like it is, I use that data as it is in my program, and like you said it is not worth the trouble to try as this is indeed a edge case, sorry.

GeorgeFive commented 1 year ago

Totally understood, no problems. I think I could handle it myself on my end, as I only need the "performed by" part and I can just regex that. ID is an issue, but I can likely set up something to connect them manually as needed.

duck7000 commented 1 year ago

If you have/find a way to do it the right way let met know as i need help

duck7000 commented 1 year ago

That ID you refer to is currently not in the soundtrack method, that is only possible if i change it to html version of comments. It is not that hard to check if there is a anchor link in the text and get the name and id. If there is no link it is much harder as i need a split point to split the name from the role name.

It is not impossible but i don't see it at the moment

I'm curious how you want to handle that data at your end? If you find a reliable way i might use that to adjust the method itself? At the moment i don't see it, so if you do let me know.

GeorgeFive commented 1 year ago

This was mainly just me thinking out loud, and how I could hack in a bit of a solution on my end (that wouldn't work with this class). I only need performer, so I could regex on the current data that is returned and pull that out (and discard everything else that may be returned, I don't need writers or anything else).

I do store all credits and IDs in my database, so it shouldn't be an issue to select imdb_id where name="Alice In Chains", then probably do some manual clicking if multiple people/ bands with the same name exist.

So I store that connection, and any future soundtrack items for Alice In Chains is already handled.

The only flaw I'm seeing, is if there are multiple bands with the same name... but I guess I could cross that bridge when/ if it happens.

So.... yeah, this is kind of hacky, and wouldn't appeal to everyone, especially if they want all the data. I mainly want to put up a page with what you would see on the back of the CD soundtrack....

  1. Alice In Chains - "Grind"
  2. Nirvana - "Lithium" Etc., with band names linked to their respective IDs.
duck7000 commented 1 year ago

Ah i see what you mean now, this is pure your own solution not a "for all" solution

I'll think about it again

duck7000 commented 1 year ago

I labeled this won't fix but feel free to comment again if you have additional info/way to fix this please do

duck7000 commented 11 months ago

@GeorgeFive is this still something you are interested in?

I figured that i could add a extra credits_raw array (just like imdphp did) with the html version of soundtrack comments/credits? This way the data you want is available and the general public does not need to use it if they don't want (like me)

You will have to parse out the data you want yourself as it is too complicated to do it in imdbhp6

GeorgeFive commented 11 months ago

I never used this in imdbphp to be honest, so I'm not sure how they returned the data. Would it be pretty similar to how I mentioned, or a big block of credits in one string, or....? Definitely curious!

duck7000 commented 11 months ago

The data would be like it is now with credits with the exception that it is html (which include anchor tags in most cases) You can see what it is like in above screenshot Every line will have a array item with what is shown in html

No it will not be what you really want but at least you will have the data, you will have to make a foreach to loop that data, determine if it is what you want and puzzle out how to get performer using stri_pos()

So no it is not a perfect solution, that would be to hard, to difficult and too complicated.

GeorgeFive commented 11 months ago

Hmmm.... I'd say just hold back on that. I really don't think it's worth it, I could use the normal version if need be. Thanks anyway though!

duck7000 commented 11 months ago

No problem i just thought of it.

duck7000 commented 8 months ago

@GeorgeFive

I'm still thinking how we can achieve this mess (and yes it is a mess)

So far i have made a list of different types and names used in Soundtracks, not even sure it is complete but it might a starting point

by types:
Arranged by
Composed by
Performed by
Written by
written and Produced by
Written and Performed by
Composed and Performed by
Arranged and Performed by
Performed and hummed by
Music and lyrics by
Adapted and Arranged by
Produced by
Administered by
Conducted by
by
Recorded by
Also performed by
Music by
Lyrics by
Sung by
Sung a cappella by
Introduction by
Words by
Performed on guitar by
Guitar Solo by
Played on piano by
Hummed by
Whistled by
Libretto by
Words and music by
Written & Produced by
Orchestral Arrangement by
Vocals Produced by
Soloist

Not necessarily persons, most likely company:
Licensed by
All rights administered by
Co-Produced by
Additional production by
Published by
By Arrangement with
Courtesy
Courtesy of
Under license from
Based on
Contains a sample of
Used by permission of
Issued under licence from
A Division of
Licensed courtesy of
Licensed with kind permission of
Licensed with courtesy of
With kind authorization of
Master courtesy of

You are after performer (a person) so i figured that the last part of this list (company ) is not worth the trouble and would be returned as is with tags stripped? So everything that not contains the word "by" is treated as text and might in that case be treated as comment? This way we can get a bit of separation between credit and comment although it is not extremely solid

The first part of the list is of interest, so far i concluded that if the word "by" is used this mentions a person and so possibly an imdbid

I'm working on a array with all those different names so i can check if it exists in array or not and return a decent name like Performed by would become Performer

So if i split the string after the word "by", check the first part with the array and use the second part to check for names and id's it might be possible to get something working.

I'm carefully optimistic but there is still a lot of uncertainty with this. Maybe i add a parameter that sets if you want creditsplit or not, something like that (this is just a thought for now)

duck7000 commented 8 months ago

@GeorgeFive Can you check if there are any more soundtrack credit types like the ones in the above list? They must contain the word by otherwise it would be rather impossible

GeorgeFive commented 8 months ago

Sounding pretty good so far! I can't think of any other types off the top of my head, and I've not seen any other new ones in the examples I checked. Maybe this would be a good time to create an error log function? If something isn't processed properly, log it and figure out what the problem is. That would find any stragglers pretty easily, and would be much easier than manually checking for errors.

Will probably also have to work around "as" credits here... ie https://www.imdb.com/title/tt4463894/soundtrack/

Also, haven't forgot about title search advanced, just been incredibly busy with building my new server (done!), updating a lot of code that didn't work properly or threw warnings in newer versions of php (done....?), and real-life work schedule (never done). Going to try to get to it tonight when I get home.

duck7000 commented 8 months ago

Well a error log function could be helping for future issues but right now this soundtrack issue is already a huge project to deal with. Maybe later.. Actually i am working on this for days now (not all day of course), slowly progress that is..

The "as" issue i thought about and decided to ignore it. You can through the imdbid from this person get that name with the Name class (probably his/her birth name i guess?) If i must consider this too it will be even more complicated as it already is. It may be added later but for now i ignore it, sorry. And you said that the performer id is all you want? And the imdbid is connected to the name in the anchor tag, not the 'as' name.

I have some code working but it needs a lot of testing before i consider it working so you can test it. It will take a while.

GeorgeFive commented 8 months ago

Yeah, the "as" thing isn't something I'm interested in, I just meant that it's a parsing issue to keep in mind. Wasn't sure if you had seen examples of that.

And yes, the imdb id is all I really need, I plan on using the name class to get all the information I need from that. If you can keep the by type ("Performed by") in the array, that would be cool, but not required.

Thanks for the work!

duck7000 commented 8 months ago

Yes i have seen those examples, if there is a link i just extracts those links, ignoring the rest. Performed by will be in the output array as Performer, same for all other "types". If the type is not in my check array it will be returned as text stripped from the word by. If later there are more types i can easily add them.

So far the performer, performer name and performer id, (if available) will be in the output array if there are no links, then the type (performer etc) and the rest of the text will be in the output array.

At the moment i'm working/dealing with what we consider comments and how they will fit in the output array.

This "project" needs proper testing and probably will not get all cases, so i hope that with your testing the biggest bugs going to be crushed. I other words it will not be perfect, but it will at least find all available links, names, id's and types

Personally i won't be using this but in imdbphp it really frustrated me that it was so complicated and the method they used was a real mess so i want to do better then what they did hah

When i consider it working i will link the file here so you can test it before i add it to the project itself.

duck7000 commented 8 months ago

Performed by will be in the output array as Performer, same for all other "types". If the type is not in my check array it will be returned as text stripped from the word by. If later there are more types i can easily add them.

What do you even want as type? I assumed that 'Performed by' would be converted to Performer but i never asked if you want that? I can leave it 'Performed by' that will save the trouble with that list as it is returned as i found in the soundtrack text. Below function is in that case not needed.

    protected function checkSoundtrackType($inputData)
    {
        $str = strtolower($inputData);
        $types = array(
            'composed by' => 'Composer',
            'performed by' => 'Performer',
            'written by' => 'Writer',
            'written and produced by' => 'Writer and Producer',
            'written and performed by' => 'Writer and Performer',
            'composed and performed by' => 'Composer and Performer',
            'arranged and performed by' => 'Arranger and Performer',
            'performed and hummed by' => 'Performer and hummed',
            'music and lyrics by' => 'Music and Lyrics',
            'adapted and arranged by' => 'Adapter and Arranger',
            'produced by' => 'Producer',
            'administered by' => 'administer',
            'conducted by' => 'Conducting',
            'by' => 'By',
            'recorded by' => 'Recording',
            'also performed by' => 'Performer',
            'music by' => 'Music',
            'lyrics by' => 'Lyrics',
            'sung by' => 'Singer',
            'sung a cappella by' => 'Singer A Cappella',
            'introduction by' => 'Introducer',
            'words by' => 'Wording',
            'performed on guitar by' => 'Performer Guitar',
            'guitar Solo by' => 'Guitar Solo',
            'played on piano by' => 'Piano',
            'hummed by' => 'Humming',
            'whistled by' => 'Whistler',
            'libretto by' => 'Libretto',
            'words and music by' => 'Words and Music',
            'written & produced by' => 'Writer and Producer',
            'orchestral arrangement by' => 'Orchestral Arranger',
            'vocals produced by' => 'Vocalist'
        );
        if ($types["$str"] !== false) {
            return $types["$str"];
        } else {
            return trim(str_replace("by", "", $inputData));
        }
    }

If you do want this function will you check if the names that i gave them is correct? I'm from the Netherlands and my English is not what it used to be so some words or text might need adjustments.

2e question: do you want nameId to include nm or only digits? through this library only digits are used, i have mixed feelings about that. From a int point of view only digits make sense but from a complete id point of view nm should be included.

duck7000 commented 8 months ago

I found some other types and added them.

        $types = array(
            'composed by' => 'Composer',
            'performed by' => 'Performer',
            'written by' => 'Writer',
            'written and produced by' => 'Writer and Producer',
            'written and performed by' => 'Writer and Performer',
            'composed and performed by' => 'Composer and Performer',
            'arranged and performed by' => 'Arranger and Performer',
            'performed and hummed by' => 'Performer and hummed',
            'music and lyrics by' => 'Musician and Lyricsist',
            'adapted and arranged by' => 'Adapter and Arranger',
            'produced by' => 'Producer',
            'administered by' => 'administer',
            'conducted by' => 'Conducting',
            'by' => 'By',
            'recorded by' => 'Recording',
            'also performed by' => 'Performer',
            'music by' => 'Music',
            'lyrics by' => 'Lyrics',
            'sung by' => 'Singer',
            'sung a cappella by' => 'Singer A Cappella',
            'introduction by' => 'Introducer',
            'words by' => 'Wording',
            'performed on guitar by' => 'Performer Guitar',
            'guitar Solo by' => 'Guitar Solo',
            'played on piano by' => 'Piano',
            'hummed by' => 'Humming',
            'whistled by' => 'Whistler',
            'libretto by' => 'Librettist',
            'words and music by' => 'Words and Music',
            'orchestral arrangement by' => 'Orchestral Arranger',
            'vocals produced by' => 'Vocalist',
            'licensed by' => 'Licensed',
            'all rights administered by' => 'All Rights Administered',
            'co-produced by' => 'Co-Producer',
            'additional production by' => 'Additional Production',
            'published by' => 'Publisher',
            'vocal and additional production by' => 'Vocals and Additional Production',
            'mixed by' => 'Mixed',
            'traditional, arrangement by' => 'Traditional, Arranged',
            'publishing 1951 by' => 'Published 1951',
            'Sung and Danced to by' => 'Singer and Dancer',
            'music only played by' => 'Music Only Played',
            'reprised by' => 'Repriser',
            'courtesy of naxos by' => 'Courtesy of Naxos',
            'arranged by' => 'Arranger'
        );
duck7000 commented 8 months ago

Title.php.tar.gz

I have uploaded title.php for you to test. (when you have time of course) It might be a older version so only test soundtrack. There are 2 versions of credits: credits as they are now and creditSplit with the split up variant so both data is available. I think most cases are covered with names and nameId, the rest is returned either as text or as comment.

doc blocs are not done jet so you have to concentrate on the output array

If there are any problems or you think that something can be improved feel free to let me know. As i said before it won't be perfect, and my goal was to get the type, name and id from the performer (or other types)

Edit: I started over, the file uploaded here works but is not full proof (it does not deal with creditors without a link properly) You can test it anyway to get a impression

duck7000 commented 8 months ago

The output array (from movie 1408) looks like this:

Array
(
    [0] => Array
        (
            [soundtrack] => We've Only Just Begun
            [credits] => Array
                (
                    [0] => Written by Roger Nichols (as Roger S. Nichols) and Paul Williams (as Paul H. Williams)
                    [1] => Performed by The Carpenters
                    [2] => Courtesy of A&M Records
                    [3] => Under license from Universal Music Enterprises
                )

            [creditSplit] => Array
                (
                    [0] => Array
                        (
                            [creditType] => Writer
                            [name] => Roger Nichols
                            [nameId] => nm0629720
                        )

                    [1] => Array
                        (
                            [creditType] => Writer
                            [name] => Paul Williams
                            [nameId] => nm0931437
                        )

                    [2] => Array
                        (
                            [creditType] => Performer
                            [name] => The Carpenters
                            [nameId] => nm1135559
                        )

                )

            [comment] => Array
                (
                    [0] => Courtesy of A&M Records
                    [1] => Under license from Universal Music Enterprises
                )

        )

    [1] => Array
        (
            [soundtrack] => Watching The River Flow
            [credits] => Array
                (
                    [0] => Written and Performed by Bob Dylan
                    [1] => Courtesy of Columbia Records
                    [2] => By Arrangement with Sony BMG Music Entertainment
                )

            [creditSplit] => Array
                (
                    [0] => Array
                        (
                            [creditType] => Writer and Performer
                            [name] => Bob Dylan
                            [nameId] => nm0001168
                        )

                    [1] => Array
                        (
                            [creditType] => By Arrangement with
                            [name] => Sony BMG Music Entertainment
                            [nameId] => 
                        )

                )

            [comment] => Array
                (
                    [0] => Courtesy of Columbia Records
                )

        )

    [2] => Array
        (
            [soundtrack] => The Weight
            [credits] => Array
                (
                    [0] => Written by Robbie Robertson
                    [1] => Performed by The Band
                    [2] => Courtesy of Capitol Records
                    [3] => Under license from EMI Film & Television Music
                )

            [creditSplit] => Array
                (
                    [0] => Array
                        (
                            [creditType] => Writer
                            [name] => Robbie Robertson
                            [nameId] => nm0005371
                        )

                    [1] => Array
                        (
                            [creditType] => Performer
                            [name] => The Band
                            [nameId] => nm1408415
                        )

                )

            [comment] => Array
                (
                    [0] => Courtesy of Capitol Records
                    [1] => Under license from EMI Film & Television Music
                )

        )

    [3] => Array
        (
            [soundtrack] => At Midnight
            [credits] => Array
                (
                    [0] => Written by Alan Blackman
                    [1] => Courtesy of Selectracks Music Services
                )

            [creditSplit] => Array
                (
                    [0] => Array
                        (
                            [creditType] => Written by
                            [name] => Alan Blackman
                            [nameId] => 
                        )

                )

            [comment] => Array
                (
                    [0] => Courtesy of Selectracks Music Services
                )

        )

    [4] => Array
        (
            [soundtrack] => Eine Kleine Nachtmusik
            [credits] => Array
                (
                    [0] => Written by Wolfgang Amadeus Mozart
                    [1] => Performed by The Swedish Concert Orchestra
                    [2] => Courtesy of Naxos
                    [3] => By Arrangement with Source/Q
                )

            [creditSplit] => Array
                (
                    [0] => Array
                        (
                            [creditType] => Writer
                            [name] => Wolfgang Amadeus Mozart
                            [nameId] => nm0003665
                        )

                    [1] => Array
                        (
                            [creditType] => Performed by
                            [name] => The Swedish Concert Orchestra
                            [nameId] => 
                        )

                    [2] => Array
                        (
                            [creditType] => By Arrangement with
                            [name] => Source/Q
                            [nameId] => 
                        )

                )

            [comment] => Array
                (
                    [0] => Courtesy of Naxos
                )

        )

)
/
duck7000 commented 8 months ago

Another concern is this: Performed by [Morgana King](https://www.imdb.com/name/nm0455088/?ref_=ttsnd) (uncredited) is uncredited in this case important? (i hope not hah) This is not handled and ignored

Or this: Written by Gerald Sanders, Jesse Sanders, Norman Sander, and Leonard Delaney (multiple creditors without a link) This is not handled, creditors are returned as text string

And this: Written by [Bob Bogle](https://www.imdb.com/name/nm1175067/?ref_=ttsnd), Nole Edwards, and [Don Wilson](https://www.imdb.com/name/nm1174754/?ref_=ttsnd) Mixed creditors with and without links This is not handled, only links are fetched, rest is ignored

Another one: Performed by [Torleif Thedeen](https://www.imdb.com/name/nm5106648/?ref_=ttsnd) (as Torleif Thedéen) & Entcho Radoukanov

Written by [Brian Eno](https://www.imdb.com/name/nm0006061/?ref_=ttsnd), [Michael Beinhorn](https://www.imdb.com/name/nm0067251/?ref_=ttsnd), Axel Gros & [Bill Laswell](https://www.imdb.com/name/nm0489960/?ref_=ttsnd) Separated by & and mixed links and text, links are returned, rest ignored.

Next: Sung by [Peggy Wood](https://www.imdb.com/name/nm0939931/?ref_=ttsnd) (dubbed by [Margery MacKay](https://www.imdb.com/name/nm0571018/?ref_=ttsnd)), [Anna Lee](https://www.imdb.com/name/nm0496819/?ref_=ttsnd), [Portia Nelson](https://www.imdb.com/name/nm0625675/?ref_=ttsnd), [Marni Nixon](https://www.imdb.com/name/nm0633262/?ref_=ttsnd), [Ada Beth Lee](https://www.imdb.com/name/nm3529693/?ref_=ttsnd), and [Doreen Tryden](https://www.imdb.com/name/nm0874488/?ref_=ttsnd) Dubbed by is not handled, ignored but link is added (this is not a big problem i guess)

Even more: ([Michael Jary](https://www.imdb.com/name/nm0419135/?ref_=ttsnd) / [Bruno Balz](https://www.imdb.com/name/nm0051300/?ref_=ttsnd)) Here is no by but there are links, handled as comment

next one: Piano Soloist: [Jaromir Klepac](https://www.imdb.com/name/nm3657495/?ref_=ttsnd), Guitar Soloist: [Jaroslav Novák](https://www.imdb.com/name/nm6617336/?ref_=ttsnd) Not handled, retuned as comment (not a big problem i guess)

And i'm not sure this will be the end of it... Let me know if any of above is of interest for you, i will try but this is going to be a uphill battle, and it all comes down due the fact that imdb let users soundtrack info put in a text field instead of given predefined choices...

@GeorgeFive lot to read here, take your time it is not meant to rush

GeorgeFive commented 8 months ago

protected function checkSoundtrackType Looks solid to me! administer should probably be capitalized, that's the only thing I see that should be changed. As for Published 1951.... that's pretty random. Is there a possibility of other years for that type?

I pretty much agree with your examples and how to handle them.... let's see....

Dubbed by is not handled, ignored but link is added (this is not a big problem i guess) How will that work out? Do you mean the entire dubbed section is stripped out? That would be fine, just wanted to make sure I understood that one.

Not handled, retuned as comment When we move stuff to a comment like this, will the html be stripped? I would definitely prefer that, would make things a lot smoother.

I needed to walk out the door ten minutes ago, so I will pick this back up tonight!

duck7000 commented 8 months ago

protected function checkSoundtrackType Looks solid to me! administer should probably be capitalized, that's the only thing I see that should be changed. As for Published 1951.... that's pretty random. Is there a possibility of other years for that type? It passed by so i added it, probably filled in wrong, no idea if this happens often..

I pretty much agree with your examples and how to handle them.... let's see....

Dubbed by is not handled, ignored but link is added (this is not a big problem i guess) How will that work out? Do you mean the entire dubbed section is stripped out? That would be fine, just wanted to make sure I understood that one.

The text dubbed by is stripped out, the name and id are added

Not handled, retuned as comment When we move stuff to a comment like this, will the html be stripped? I would definitely prefer that, would make things a lot smoother.

Html will always be stripped, including comments. the html is only used to get the name and id I needed to walk out the door ten minutes ago, so I will pick this back up tonight!

Creditors with or without a link is now handled (i started over with different point of view) separated by (& , and) is also handled. So all credits are in the output array with or without id

Will you read the other comments i made above, i only tagged the last one to you

GeorgeFive commented 8 months ago

Found a new type.... "Performed and Produced by" https://www.imdb.com/title/tt0109506/soundtrack/

Also a weird capitalization thing.... imdb shows "Time Baby II", our array returns "Time Baby Ii"


imdb has: Eyes Without a Face Performed by The Flesh Eaters Composed by Desjardins, Don Kirk, Robyn Jameson, Chris Wahl Produced by [lik=nm0195968]

So we get: [creditType] => Produced by [name] => [lik=nm0195968] [nameId] =>

https://www.imdb.com/title/tt0089907/soundtrack/ This is obviously an error on imdb's side, but maybe we could fix it on our end? The data is there...


https://www.imdb.com/title/tt0095990/soundtrack More types... "Performed and Written by" and "Performed, Written and Produced by"


https://www.imdb.com/title/tt0088763/soundtrack/ Out the Window (uncredited) Written and Performed by Edward Van Halen (uncredited) [Played by Marty to George when he is pretending to be Darth Vader from Planet Vulcan]

The "by" gets caught and formatted oddly....

[creditType] => Played by [name] => Marty to George when he is pretending to be Darth Vader from Planet Vulcan] [nameId] =>


https://www.imdb.com/title/tt0109830/soundtrack/

This is very obviously an error on imdb's side, but it can result in a null credittype.

Webster's Boomer bWritten by y David Michael Frank (as David Frank)

[creditType] => [name] => David Michael Frank [nameId] => nm0006080

New "by" - "Arranged by" and "Written & Performed by"

GeorgeFive commented 8 months ago

These are the only issues I've found so far. I obviously haven't tested hundreds of titles yet, but it seems to be a bunch of minor stuff. Looking good!

2e question: do you want nameId to include nm or only digits? through this library only digits are used, i have mixed feelings about that. From a int point of view only digits make sense but from a complete id point of view nm should be included.

Personally, I would prefer to strip out the nm. I would strip it out on my end otherwise, as I store everything with int (as mentioned) in my database and only append nm or tt when needed.

duck7000 commented 8 months ago

Found a new type.... "Performed and Produced by" https://www.imdb.com/title/tt0109506/soundtrack/

Fixed Also a weird capitalization thing.... imdb shows "Time Baby II", our array returns "Time Baby Ii"

Will investigate, this is because the title is lowercased and then ucwords (uppercase words) , i did this because sometimes the title is all uppercase. I changed this part, title is checked if all uppercase or not. imdb has: Eyes Without a Face Performed by The Flesh Eaters Composed by Desjardins, Don Kirk, Robyn Jameson, Chris Wahl Produced by [lik=nm0195968]

So we get: [creditType] => Produced by [name] => [lik=nm0195968] [nameId] =>

Flaw at imdb, i corrected this data through imdb (don't know if they will or when) I added code to at least get the id, no name is fetched though https://www.imdb.com/title/tt0089907/soundtrack/ This is obviously an error on imdb's side, but maybe we could fix it on our end? The data is there...

https://www.imdb.com/title/tt0095990/soundtrack More types... "Performed and Written by" and "Performed, Written and Produced by"

Fixed https://www.imdb.com/title/tt0088763/soundtrack/ Out the Window (uncredited) Written and Performed by Edward Van Halen (uncredited) [Played by Marty to George when he is pretending to be Darth Vader from Planet Vulcan]

The "by" gets caught and formatted oddly....

[creditType] => Played by [name] => Marty to George when he is pretending to be Darth Vader from Planet Vulcan] [nameId] =>

Well flaw in my "by" as split point, we have to accept that this is not always solid, i will think about it. https://www.imdb.com/title/tt0109830/soundtrack/

This is very obviously an error on imdb's side, but it can result in a null credittype.

Webster's Boomer bWritten by y David Michael Frank (as David Frank)

I can't fix this one, this is so wrong filled in by a imdb user. I changed this data by imdb itself (don't know if they will) [creditType] => [name] => David Michael Frank [nameId] => nm0006080

New "by" - "Arranged by" and "Written & Performed by" Fixed

I will investigate those cases but it is not always fixable as they are merely flaws from imdb users

Best to do is to correct those data at imdb (if you have a account, i do) so i corrected a few of above faults

duck7000 commented 8 months ago

Title3.tar.gz

I made a new version, the approach is different in this one. It will fetch all names and id's if available so it should work better than the first version. If you have the time check it out please

Edit: i found another flaw in my function if this is in the credit line: Performed by The Jesus & Mary Chain The name is split in two parts due the & been replaced by , and split up by , So the first performed part will have the correct id and half the name, the second only has the other half of the name..

Technically i can live with that. You can get the correct name from credits, get it through the name class or append those 2 parts. but it is a flaw which i can not fix without breaking it.

duck7000 commented 8 months ago

Sung by Peggy Wood (dubbed by Margery MacKay), Anna Lee, Portia Nelson, Marni Nixon, Ada Beth Lee, and Doreen Tryden Dubbed by is completely ignored in the latest version (no type, name or id are added) so i think that is fine as the dubbed person is not the singer.

(Michael Jary / Bruno Balz) Here is no by but there are links, i changed this through imdb so that might be handled

next one: Piano Soloist: Jaromir Klepac, Guitar Soloist: Jaroslav Novák Not handled, for now returned as comment, but i changed this at imdb as well so that might not be a problem anymore.

duck7000 commented 8 months ago

i hope final version.. Title4.tar.gz

Most issues should be crushed (lets hope that imdb accepts my changes) The issue with the jesus & mary chain remains

Now you surely understands why i hesitated to even try hahaha So many different scenario's to deal with But my life motto is "never quit" so now we do have a working soundtrack (with a few questionable leftovers that is..)

Edit: I added an echo if found type not in array, it will help you identified with type needs to be added Title5.tar.gz

GeorgeFive commented 8 months ago

The issue with the jesus & mary chain remains Maybe we could do an easy way with this, and replace "&" with "and" before processing? This may be a more common problem, there's lots of bands out there with "&" in their name. I don't think anyone will really care about that replacement, and if they do, they can always run it through the name class to get the & back.

No work tomorrow, so I plan on doing some serious work with this. Thanks!

duck7000 commented 8 months ago

well the main problem is that i use that & or and as split point to separate the artists So i must check if & is used as separator (so i can use it as split point) or that & is in the anchor name

In other words & is used in anchor name AND as separator between artists, the problem is how i deal with that

It is not that simple.

The first title that i uploaded here has a different approach, that fetches only anchor links, but has a problem with mixed artists with a anchor or without, so that doesn't work. The later versions use explode to break up artist separated by comma. The problem is that there sometimes is a comma, sometimes &, sometimes and. so i replace all those different separators with comma. This works fine but does break up anchors that have & in their name. This is a nasty problem, and again this comes down to how imdb handles the user input.

Edit: I think i fixed it, through a ugly (and not my cup of tea!) regex.. I replaces '&' or 'and' only if it is not in anchor tag. It is impossible to factor in all possibility's so there will remain edge cases

Here is the new version.. i doubt it will be the final hah Title6.tar.gz

duck7000 commented 8 months ago

Good news!

IMDb has all my changes approved! So all faults related to bad user input in your examples are fixed. Apparently it pays off to request changes so if you find more issues please report it to IMDb

GeorgeFive commented 8 months ago

https://www.imdb.com/title/tt0069091/soundtrack/ Not in array: Written and Sung by

https://www.imdb.com/title/tt0109506/soundtrack/ Not in array: ©Fractured Music (all rights administered by Not in array: Written and Arranged by

https://www.imdb.com/title/tt0117330/soundtrack/ Not in array: Written, Performed and Produced by

https://www.imdb.com/title/tt2267968/soundtrack/ Not in array: Production supervised by

"Lyricsist" is misspelled = Lyricist

https://www.imdb.com/title/tt1302011/soundtrack/ Not in array: Used by

https://www.imdb.com/title/tt0072431/soundtrack/ Not in array: Sung and Danced by Not in array: Also sung by

https://www.imdb.com/title/tt1431045/soundtrack/ Not in array: Courtesy of Barnaby Not in array: (contains a sample of "Low Rider" performed by

https://www.imdb.com/title/tt5334704/soundtrack/ Not in array: Guitar by

https://www.imdb.com/title/tt1051906/soundtrack/ Not in array: (c) Published by Not in array: (c) Published by Not in array: (c) Quiet as Kept Music Inc. Licensed by ^^Which is weird because I don't even see this data displayed on their page.

https://www.imdb.com/title/tt0167260/soundtrack/ Not in array: Adapted by Not in array: Orchestration by

https://www.imdb.com/title/tt0109830/soundtrack/ Not in array: Adaption and Music by

https://www.imdb.com/title/tt0088763/soundtrack/ The Jesus & Mary Chain example worked great, but it's not working as well with Huey Lewis & The News... it returns double results, one right, one not. `` ( [creditType] => Performer [name] => Huey Lewis & The News [nameId] => 3539158 ) [1] => Array ( [creditType] => Performer [name] => The News) [nameId] => )

GeorgeFive commented 8 months ago

So it seems like the only real issues now is the "not in array" stuff, which will undoubtedly be endless.

When I actually start running this and not just testing it, it'll be in automatic mode and I won't be catching this stuff... maybe a simple error log function? Could be something as simple as this - https://www.geeksforgeeks.org/how-to-log-errors-and-warnings-into-a-file-in-php/

duck7000 commented 8 months ago

https://www.imdb.com/title/tt0069091/soundtrack/ Not in array: Written and Sung by

https://www.imdb.com/title/tt0109506/soundtrack/ Not in array: ©Fractured Music (all rights administered by Not in array: Written and Arranged by

https://www.imdb.com/title/tt0117330/soundtrack/ Not in array: Written, Performed and Produced by

https://www.imdb.com/title/tt2267968/soundtrack/ Not in array: Production supervised by

"Lyricsist" is misspelled = Lyricist

https://www.imdb.com/title/tt1302011/soundtrack/ Not in array: Used by

https://www.imdb.com/title/tt0072431/soundtrack/ Not in array: Sung and Danced by Not in array: Also sung by

https://www.imdb.com/title/tt1431045/soundtrack/ Not in array: Courtesy of Barnaby Not in array: (contains a sample of "Low Rider" performed by

https://www.imdb.com/title/tt5334704/soundtrack/ Not in array: Guitar by

https://www.imdb.com/title/tt1051906/soundtrack/ Not in array: (c) Published by Not in array: (c) Published by Not in array: (c) Quiet as Kept Music Inc. Licensed by ^^Which is weird because I don't even see this data displayed on their page.

https://www.imdb.com/title/tt0167260/soundtrack/ Not in array: Adapted by Not in array: Orchestration by

https://www.imdb.com/title/tt0109830/soundtrack/ Not in array: Adaption and Music by

https://www.imdb.com/title/tt0088763/soundtrack/ The Jesus & Mary Chain example worked great, but it's not working as well with Huey Lewis & The News... it returns double results, one right, one not. `` ( [creditType] => Performer [name] => Huey Lewis & The News [nameId] => 3539158 ) [1] => Array ( [creditType] => Performer [name] => The News) [nameId] => )

Well all not in array is a nightmare and we should abandon this. This will never work right. I suggest to remove this and return everything as is stripped from the word by? Or completely as is?

The jesus & mary chain works indeed but not if that is no anchor link. Like i said it is impossible to no if the '&' is used inside a name or as separator between artists (both are used)

And to make matters worse imdb allows '/' as separator between artists as well so i have to factor that in as well..

I told you up front that this is no easy task and i begin to think it might not be possible to catch all cases

I edited about 50 titles the soundtrack entry's at imdb which contains errors, badly written, not like the imdb examples etc but this is a unforgiving task

duck7000 commented 8 months ago

So it seems like the only real issues now is the "not in array" stuff, which will undoubtedly be endless.

When I actually start running this and not just testing it, it'll be in automatic mode and I won't be catching this stuff... maybe a simple error log function? Could be something as simple as this - https://www.geeksforgeeks.org/how-to-log-errors-and-warnings-into-a-file-in-php/

I will look in to this

The big question is what would be have to logged?

GeorgeFive commented 8 months ago

I have a feeling that I could test 100 more movies, and find 80 more "by" entries. Maybe we should step back and simplify this a bit. Instead of trying to grab every possible "by" out there, we focus more on the standard ones? Performed, written, produced, composed, etc. If we want the ids and name data for the "meat" of the credit, we can use creditSplit.... if we want every possible nugget of information, we use credits.

Might also make it a little easier to find the "X and X by" entries... ie, run a regex on "Written and Performed by", find "Written", give a writer credit, find "Performed", give a performed credit.... remove the "Written and Performed by" altogether.

Things like "Sung and Danced by" could be stripped entirely from creditSplit, nothing listed in the credits section even if there is a name id... but we could leave it in the main credits part of the array for completeness.

duck7000 commented 8 months ago

I don't completely understand what you mean but i suggest that i remove the whole function with all the different types.

I return the X by string as is, or stripped from by? What you want with this info is then up to you? I hate regex, this will complicate things even more

GeorgeFive commented 8 months ago

So in the array, we keep the credits section as is. It has all the data, no ids, just a nice chunk of text with every possible thing in it. We don't need to scan it or play with it, we just get it straight from imdb and leave it as is now.

creditSplit will get the main data, but not all the niche data.

If it says "Performed by", "Performed and written by", "Performed and produced by", "Performed and dreamed by", "Performed and typoblah by".... anything like that.... we skim for the word "Performed" and give them a performed credit.

Same thing with "Written __", "Produced ", "Composed __", "Conducted".... that may be enough?

So in this case, the creditType field would only have five possible values... if it doesn't match one of the above words, we ignore it in creditSplit (but leave it in the main credits part of the array).

I think this may be the direction to go, because we will never get all possible "by" values, and any regex trickery we come up with to try to solve this will inevitably break something else. I'd rather focus on the base data, and know that it is right, than worry about the niche case of "hummed by" popping up.

duck7000 commented 8 months ago

Ah okay i see what you mean now

credits is and will always remain like it is now and contains all data in text form

creditSplit get only creditType if it is one of those main types, ignore the rest? (Or a default type?) So we focus on the first main word of type

I will try, may be tonight

GeorgeFive commented 8 months ago

Pretty much, yep. Not necessarily focus on only the first word though, we would pick keywords from the string itself.

Main credits gets everything and stays as is.

creditSplit example Performed by = performer credit Performed and written by = performer credit and writer credit Performed, written, and hummed by = performer credit, writer credit, ignore hummed

duck7000 commented 8 months ago

Performed and written by = performer credit and writer credit

This would mean that the creditor will be added 2 times (or more) to creditSplit? That means i have to split up the part before by and use it twice as creditType. This part will get more complicated (the other part was already a nightmare, but finally acceptable)