Data4DM / BayesSD

Data for Decision, Affordable Analytics for All
9 stars 0 forks source link

comparing old and new movie data #138

Closed hyunjimoon closed 8 months ago

hyunjimoon commented 1 year ago

issues

slightly different names

new is smaller than imdb online

-- should i include all category? array(['self', 'director', 'cinematographer', 'composer', 'producer', 'editor', 'actor', 'actress', 'writer', 'production_designer', 'archive_footage', 'archive_sound']

tt2236646       video   Courier Culture Courier Culture 0       2012    \N      9       Biography,Documentary,News
bash-3.2$ grep -w "Courier Culture" movie_principals.tsv 
tt7813156       10      nm9522476       self    \N      ["Self - The Courier Culture Editor"]
tt7813156       9       nm9522475       self    \N      ["Self - The Courier Culture Editor"]

statistics

old: 16m (15870224) new: 20m (20517830) oldnew_left_merge: 16m (15876865; [can increase if right has duplicate [title_year, primaryName] row](f the right table has two records that match to one record in the left table, it will return two records.)) oldnew_inner_merge: 2.5m

Q. for new, isn't 1:3 for title: title-person too small? (can ~10 casts be small enough, can 1 cast e.g. documentary be large enough, to explain this?)

hyunjimoon commented 1 year ago

less imminent notes

'O'Connell, Colette Hello Au Revoir (2018) [Letty O]



- `Finbar 'Finchie''Coveney ` becomes  `Finbar 'Coveney` after processing
hyunjimoon commented 1 year ago

300k has different startYear with same parentTconst

(-> treat them differently as it could cancel out more old?) image

hyunjimoon commented 1 year ago

Compared to old data which shows at least a hundred person-title for Suits (2011), new data (movie-principals) include only 5 rows (for entire, not only season) image

suits.csv