Wikidata / soweego

Link Wikidata items to large catalogs
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
GNU General Public License v3.0
98 stars 9 forks source link

Investigate the intended meaning of the 'miscellaneous' profession in IMDb #161

Closed marfox closed 5 years ago

marfox commented 5 years ago

miscellaneous appears more than 1 M times in the IMDb names dump. 55.22% of the time, it is alone.

Make a sample and try to understand what is the intended meaning of that occupation.

tupini07 commented 5 years ago

This page defines miscellaneous as basically all those jobs which don't fit in any other category. However, at the end of said page it says that IMDB intends to progressively split this list into concrete job categories. So maybe in the future a smaller portion of individuals will be only miscellaneous.

The IMDB dump, name.basics.tsv.gz, only contains information on at most the top 3 professions (or job categories) of a specific person. However, by directly browsing the IMDB site (in the filmography section of a person page), we can see also the job for a person. Sadly these jobs are not present in the name.basics dump. They are present for some people in the title.principals.tsv.gz dump, but this dump only contains information on the most important people for a movie (meaning that most of them are either actor/actress, director, or producer). So it doesn't give us much more information with respect to the name.basics dump.

I analyzed the jobs of 267 random people which have miscellaneous as their profession: 134 of them are people who only have this as their profession and the remaining 133 are people who have both miscellaneous and something else. They seem to be jobs like: Assistant, Programmer, Coordinator, Consultant, Supervisor, Advertiser, Caterer, among others. See below for the full list.

Click here to see complete list of jobs obtained from sample

``` accountant accounts runner acting office actor intern additional research additional script supervisor additional voice adr director adr loop group advertising aerial photography pilot app programmer archive researcher archive source archive archivist artist production assistant assistant accountant assistant production coordinator assistant stage manager assistant to actor assistant to director assistant to executive producer assistant to line producer assistant to producer assistant associate producer associate background performer ballet arranger ballroom dance supervisor beta tester boat safety cast assistant cast manager caterer child acting coach child counsellor choreographer clearance coordinator colorist completed construction accountant consultant continuity contributor coordinator cost accountant craft service creative director crowd control dance arranger dance designer dance director dance ensemble dance sequence dances and ballet dances and ensembles dancing supervisor development support dialogue coach dialogue consultant distribution team dog wrangler double drone operator engineering manager executive story editor executive vice president accounting executive vice president extra extras coordinator extras wrangler field coordinator film researcher financial controller firearm coordinator first assistant accountant floor runner follow spot operator game design game designer game programmer game staff game supervisor game tester hardware head talent coordinator historical advisor horse wrangler intern interpreter interviewer kitchen specialist language testing legal consultant level planner local coordinator localization producer logger logistics manager marketing consultant mask maker medic multiplayer game designer narrator nurse other crew overseas director party band payroll accountant payroll clerk pilot prank consultant presenter producer attachment production accountant production assistant production coordinator production executive production intern production office production runner production secretary production staff production team production trainee program consultant program printing program supervisor programmer projectionist psychology advisor puppet fabricator puppeteer qa tester quality assurance question adjudicator question researcher question setter question verification question writer questions researcher research and development engineer research intern researcher runner sales coordinator sales scientific consultant script continuity script coordinator script editor script supervisor assistant script supervisor security coordinator security manager security senior engineer senior researcher senior vice president accounting senior vice president of finance senior vice president set production assistant set support special consultant special video production stage crew coordinator staging stand-in story editor stylised sequence subtitles supplies talent coordinator talent executive tape logger technical advisor technical consultant technical manager technical operations supervisor tester titles traffic controller transcriber translator transmission travel assistant treasurer unspecified crew member utility vice president video production videotape research vmc tester voice director voice volunteer coordinator writers assistant ```


I also analyzed a bit persons who have miscellaneous as their profession. Below are the results, which are interesting to know, but may or may not be useful.

By looking at the birth year and death year columns For people with only miscellaneous as profession:

0.95% of the entries have a birth year
0.39% have a death year

The mean birth year is: 1948
Mean death year: 1993

For people with miscellaneous and something else as primary professions:

8.99% of the entries have a birth year
2.31% have a death year

The mean birth year is: 1957
Mean death year: 1992

For all records in the dataset (including those with miscellaneous):

5.19% of the entries have a birth year
1.80% have a death year

The mean birth year is: 1950
Mean death year: 1987

Distribution of birth/death years:

figure_2

We can see that older people tend to only have miscellaneous as profession. While those that have miscellaneous and something else tend to be much younger.

We can also count the number of movies for which a person is known. In the graph below we can see a comparison between people with only miscellaneous and those with other professions (0 means that no record was present in the dump).

figure_1

We can see that people with only miscellaneous as their profession tend to have fewer movies for which their known. Which makes sense, because more movies means more possibility to do different professions.

marfox commented 5 years ago

Great work, @tupini07 , thanks a lot!

For all records in the dataset (including those with miscellaneous):


5.19% of the entries have a birth year
1.80% have a death year

OMG, very very few. We should really focus on occupations then.

marfox commented 5 years ago

@tupini07 , what's your final say on the 500k rows with miscellaneous only?

marfox commented 5 years ago

by directly browsing the IMDB site (in the filmography section of a person page), we can see also the job for a person. Sadly these jobs are not present in the name.basics dump. They are present for some people in the title.principals.tsv.gz dump, but this dump only contains information on the most important people for a movie (meaning that most of them are either actor/actress, director, or producer). So it doesn't give us much more information with respect to the name.basics dump.

@tupini07 , do you think it is worth to implement a form of Web scraping to get those jobs out?

tupini07 commented 5 years ago

Thanks @marfox , I think we could create a script to do it but we should first ask the IMDB licensing department for permission (since they don't allow scraping without explicit permission). I've created a new task #170 to track this.

tupini07 commented 5 years ago

I've received an answer from IMDB saying that scraping is not allowed. We've decided to currently add all people with miscellaneous as profession in all IMDB entities in the database.