cooperhewitt / the-pen-data

Open Data generated by Cooper Hewitt's Pen
http://www.cooperhewitt.org/open-source
Other
11 stars 5 forks source link

mismatch with object IDs #3

Closed mdlincoln closed 8 years ago

mdlincoln commented 8 years ago

IIUC, refers_to_object_id should match up with the object IDs made available in https://github.com/cooperhewitt/collection

However, when trying to join the objects.csv created by bin/generate-csv-objects.py, I find that the vast majority of refers_to_object_id values are not found in the id column of the collections data:

pen <- read.csv("data/pen-collected-items.csv")

# objects.csv is the file generated from bin/generate-csv-objects.py
ch <- read.csv("objects.csv")

# Unique IDs in the pen data
pen_ids <- unique(pen$refers_to_object_id)
length(pen_ids)
#> [1] 119971

# Unique IDs in all object data
ch_ids <- ch$id
length(ch_ids)
#> [1] 194316

# count of pen IDs found in object data
sum(pen_ids %in% ch_ids)
#> [1] 7138

# count of pen IDs _not_ found in object data
sum(!(pen_ids %in% ch_ids))
#> [1] 112833

Am I trying to associate with the wrong column?

micahwalter commented 8 years ago

Thanks Matthew. I'll have a look. Have you tried filtering for tool_id = 0 ?

mdlincoln commented 8 years ago

If I filter for tool_id == 0, the count of unique refers_to_object_id not found in the full collection object IDs drops to 325. I take this to mean that the non-pen-created IDs may not necessarily be for objects in the collection? What are they, then?

micahwalter commented 8 years ago

Matthew, tool_id == 0 are things collected by the pen from our wall labels and interactive tables. These should all have a refers_to_object_id. All other tool_ids are applications that allows our visitors to "create" things and so they don't refer to a specific object. It looks like instead of a refers_to_object_id, we are sticking in a timestamp, which is confusing and I will look into changing that so it is just NULL instead.

mdlincoln commented 8 years ago

Ahh that makes sense, as does the proposed change - though I guess there are still those 325 (well, 324 if you don't count NA/blank value) as refers_to_object_id in the tool_id == 0 entries, so I'm still not sure what is happening there.

On Wed, Mar 16, 2016 at 10:26 AM, Micah Walter notifications@github.com wrote:

Matthew, tool_id == 0 are things collected by the pen from our wall labels and interactive tables. These should all have a refers_to_object_id. All other tool_ids are applications that allows our visitors to "create" things and so they don't refer to a specific object. It looks like instead of a refers_to_object_id, we are sticking in a timestamp, which is confusing and I will look into changing that so it is just NULL instead.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/cooperhewitt/the-pen-data/issues/3#issuecomment-197353983

Matthew D. Lincoln Ph.D Candidate Department of Art History & Archaeology http://arthistory.umd.edu University of Maryland College Park, MD 20742

mlincol1@umd.edu matthewlincoln.net

micahwalter commented 8 years ago

Ok, I've created a new branch with a cleaned up dataset. Please have a look at 4ed4ebbb268a1b168cbc66c10b511284c49840ff which sets the refers_to_object_id to 0 for all rows where tool_id != 0

Let me know what you think...

-m

mdlincoln commented 8 years ago

I checked the new dataset, and it's true all refers_to_object_id are 0 for rows wehre tool_id != 0.

However, there there are now 631 refers_to_object_id not found in the table of objects IDs (again, this is all for tool_id == 0:

> setdiff(filter(pen_raw, tool_id == 0)$refers_to_object_id, ch$id)
  [1]   68268431   68764321   68764307   68764215   68764253   68764317   68775175   68775167   68764299     682460
 [11]   68268457   35520981   68774667   68775145   68764357   68245665   68764323   68764331   68764335   68764225
 [21]   68764285   68764199   68764337   68764287   68764397   68775189   68764271   68764289   68775143   68782481
 [31]   68764315   68268039   68782479   68774663   68764265   68775187   18187423   68764319   68764333   68268161
 [41]   68813715   18709775   35457409   35520983   68764309   51497589   68246009   68250509   18693285   68764305
 [51]   35520953 1429061025   68883101   68885261   69113423   69113425   68890011   68881899   68883035   68883325
 [61]   69166469   68764195   68883485   68883491   69129735   68833531   68833533   68833545   84995375   84995371
 [71]   84995373   68889901   68814023  102199991   18705229          0    6911349  691113423    8006405  850064599
 [81]   68743193    6268255  682682511   15460289   68745573    1875837  187558429  102391977  152749789  102391981
 [91]  102391985  135706757   69167759  404577581   68268453   68743489   68743501   69154985  136252037  136252039
[101]  136252041  136252043  136252045    8500465  855006457    1870931    1874073    6268299  354474659  682245681
[111]   13625243 1362552039 1362524889         NA    6825813  688250815  850064479    6828299  354744659    1874027
[121]    6824554    6826299  354746659   69155067  102335187  404529301   69129535    6820787  668250791   35350945
[131]   13625262   69155413   69155025   69155069   69155075   69154997   69154999   69155003   69155005   69155007
[141]   69155009   69155011   69155013   69155015   69155017   69155021   69155023   69155045   69155047   69155049
[151]   69155051   69155053   69155059   69155061   69155063   69155065   69155077   69155081   69155083   69155087
[161]   69155093   69155099   69155119   69155125   69155129   69155131   69155133   69155151   69155153   69155155
[171]   69155157   69155159   69155161   69155165   69155167   69155169   69155171   69155173   69155177   69155179
[181]   69155183   69155185   69155187   69155189   69155191   69155193   69155201   69155205   69155207   69155209
[191]   69155211   69155213   69155215   69155219   69155221   69155223   69155225   69155227   69155229   69155231
[201]   69155233   69155241   69155249   69155251   69155255   69155259   69155261   69155263   69155265   69155269
[211]   69155277   69155279   69155281   69155331   69155333   69155337   69155339   69155347   69155349   69155351
[221]   69155353   69155355   69155359   69155363   69155365   69155367   69155369   69155373   69155407   69172057
[231]   69172059   69172061   69172063   69172067   69172069   69172071   69172073   69172075   69172077   69172079
[241]   69172081   69172085   69172087   69172089   69172091   69172093   69172095   69172097   69172099   69172103
[251]   69172105   69172107   69172109   69172111   69192417   69192419   69192421   69192431   69192433   69192435
[261]   69192437   69192439   69192443   69192445   69192449   69192451   69192453   69192455   69192457   69192461
[271]   69192463   69192465   69192467   69192469   69192471   69192473   69192475   69192479   69192481   69192483
[281]   69192505   69192507   69192509   69192517   69192519   69192521   69192523   69192525   69192527   69192529
[291]   69192533   69192535   69192537   69193859   69193867   69193869   69193871   69193873   69193875   69193877
[301]   69193879   69193883   69193885   69193887   69193889   69193891   69193893   69193895   69193897   69193901
[311]   69193903   69193905   69193907   69193909   69193911   69193913   69193915   69193921   69193925   69193927
[321]   69193929   69193931   69193933  102199993  102199997  102335183  102335185  102335189  102335191  135918413
[331]  135918421  135918427  135918429  135918431  135918443  135918447  136300679  404529303  404529305  404529307
[341]  404529311  404529313  404529315  404529317  404529319  404529321  404529323  404529325  404529329  404529331
[351]  404529333  404529335  404529337  404529339  404529341  404529343  404529347  404529349  404529351  404529591
[361]  404584055  404584057  404584275  404584277  404584279  404584283  404584285  404584287  404584289  404584291
[371]  404584293  404584295  404584297  404584301  404584303  404584305  404584307  404584309  404584311  404584313
[381]  404584315  404584319  404584321  404584323  404584325  404584327  420560745  420565457  420565459  420565465
[391]  420565477  420565483  420565485  420565487  420565489  420565493  420565495  420565501  420565503  420565507
[401]  420565513   69155275   69155335   69155381   69155383   69155385   69155387   69155389   69155391   69155395
[411]   69155399   69155401   69155403   69155405   69155057   69155377   69192511   69192515   69155197   69155203
[421]   69155027   69155029   69155031   69155041  420565463   69192485   69192487   69192489   69192491   69192493
[431]   69192497   69192499  404734343  404734345  152749795 1355918421    6915506  691555059 1416270910 1416270945
[441] 1425682956 1427996726 1428000917 1427728880 1427997175 1427994804 1427994173 1427994785 1427994988 1427993898
[451] 1427994896 1427995766 1427992990 1427996760 1427993319 1427994457 1427994549 1427997049 1427992978 1427993219
[461] 1427991251 1427991707 1427987696 1427992099 1427987081 1427988011 1427987524 1427991672 1427987070 1427987125
[471] 1427987256 1427987343 1427987821 1427987548 1427996676 1427987011 1427987118 1427987243 1427987292 1427991485
[481] 1427988153 1427988391 1427988287 1427994271 1427987056 1427987093 1427988192 1427987772 1427987689 1427990243
[491] 1427992084 1427991961 1427992091 1427989879 1427990066 1427988952 1427988851 1427988898 1427992489 1427992405
[501] 1427992222 1427992340 1427994790 1427995492 1427995276 1427995611 1427992824 1427998034 1427997782 1427998021
[511] 1427998334 1427998476 1427998503 1427998518 1427998566 1427998627 1427998911 1427997590 1427997729 1427997744
[521] 1427997809 1427997817 1427997827 1427997671 1427997687 1427997804 1427998052 1427998320 1427998458 1427998484
[531] 1427998492 1427998585 1427998637 1427998846 1427998890 1427998908 1427998893 1427998901 1427996320 1427996489
[541] 1427996612 1427999818 1427996314 1427995033 1427995076 1427995208 1427997992 1427998163 1427998448 1428001333
[551] 1428002213 1428002326 1428002569 1428002021 1428002110 1428002263 1428002464 1428002522 1428002382 1428006103
[561] 1428005848 1428005993 1428000737 1428001747 1428002048 1428002133 1428001886 1428001965 1428002206 1427999593
[571] 1428001267 1428004099 1428003702 1428007764 1428007855 1428007912 1428004360 1428005652 1428005941 1428005902
[581] 1428004265 1428004821 1428005576 1428007103 1428006699 1428007645 1428007815 1428008878 1428010676 1428010974
[591] 1428008888 1428010163 1428010533 1428010350 1428011448 1428075511 1428075367 1428075726 1428075297 1428075971
[601] 1428074030 1428074293 1428077365 1428077264 1428075533 1428076227 1428076387 1428076741 1428077689 1428078030
[611] 1428079021 1428078196 1428081162 1428080727 1428081762 1428080039 1428078571 1428082909 1428079037 1428081409
[621] 1428081421 1428082072 1428080293 1428080265 1428082581 1428081516 1428081312 1428081322 1428081849 1428082845
[631] 1428082989
micahwalter commented 8 years ago

Ah, I see what's happening. Those 631 things ( there are more now since this is newer than the original release ) are objects that are not currently set to public on our collections site. For example the first object you have listed https://collection.cooperhewitt.org/objects/68268431 should show you a "not authorized" page when you load it in a browser. If it isn't public on the website, it probably doesn't get added to the collection data on GitHub.

There are some that still come back as not found. I'm not sure what's going on there, and then there are some like https://collection.cooperhewitt.org/objects/69192497 that do work, but likely haven't been updated in the GitHub repo as of yet.

-m

mdlincoln commented 8 years ago

Ahhhh, that makes sense! Depending on if/how you update the collection data repo to represent not-yet-authorized objects, it'd be great to have that documented in this repo's README as well - even if the answer is just "IDs not found in the objects table are just not public yet".

Good luck tracking down those other missing IDs - and thanks for checking all this out!

micahwalter commented 8 years ago

ok great. So I merged in the new dataset and updated the readme. Closing this now...