comparing old and new movie data

hyunjimoon commented 1 year ago

issues

slightly different names

y and i
abbreviation Dominique A. and Dominique Abel

new is smaller than imdb online

full credit info which is in online (viva, 2001, tvseries 1 episode) is not included in new.tsv
documentary's category is not actor or actress, but self i.e. Scrooge . from Courier Culture (order =1, but far from star)

-- should i include all category? array(['self', 'director', 'cinematographer', 'composer', 'producer', 'editor', 'actor', 'actress', 'writer', 'production_designer', 'archive_footage', 'archive_sound']

tt2236646       video   Courier Culture Courier Culture 0       2012    \N      9       Biography,Documentary,News
bash-3.2$ grep -w "Courier Culture" movie_principals.tsv 
tt7813156       10      nm9522476       self    \N      ["Self - The Courier Culture Editor"]
tt7813156       9       nm9522475       self    \N      ["Self - The Courier Culture Editor"]

statistics

old: 16m (15870224) new: 20m (20517830) oldnew_left_merge: 16m (15876865; [can increase if right has duplicate [title_year, primaryName] row](f the right table has two records that match to one record in the left table, it will return two records.)) oldnew_inner_merge: 2.5m

Q. for new, isn't 1:3 for title: title-person too small? (can ~10 casts be small enough, can 1 cast e.g. documentary be large enough, to explain this?)

hyunjimoon commented 1 year ago

less imminent notes

O'Conell?


'Nico' Arcuri, Dominic  Our Final Slumber (2016)
                    Stuck in the Middle (2016)  [Happy Man]

'O'Connell, Colette Hello Au Revoir (2018) [Letty O]



- `Finbar 'Finchie''Coveney ` becomes  `Finbar 'Coveney` after processing

hyunjimoon commented 1 year ago

300k has different startYear with same parentTconst

(-> treat them differently as it could cancel out more old?)

hyunjimoon commented 1 year ago

Compared to old data which shows at least a hundred person-title for Suits (2011), new data (movie-principals) include only 5 rows (for entire, not only season)

suits.csv

Data4DM / BayesSD