Mardak / profile

2 stars 7 forks source link

Refactored NYT cleansing #14

Closed mzhilyaev closed 10 years ago

mzhilyaev commented 10 years ago

Path cleansing:

domains white listing: Closes #13

Mardak commented 10 years ago

Comments from before:

Max Zhilyaev 1) "path": "/2009/01/27/world/europe", http://www.nytimes.com/2014/01/24/world/europe/ukraine.html?ref=world&gwh=80FF9BE765A4061FC2D1FBC55E9571A9&gwt=pay 2) "host": "developer.nytimes.com", should not be included 3) "host": "prototype.nytimes.com", should not be include

Max Zhilyaev "path": "/_DATE/a-game-that-deals-in-personal-data" shows the title. Because its link is missing .html http://bits.blogs.nytimes.com/2013/07/10/a-game-that-deals-in-personal-data/?_php=true&_type=blogs&_r=0

Max Zhilyaev This one will also contain the date after cleansing: http://www.nytimes.com/interactive/2013/12/10/world/europe/ukraine-timeline.html?gwh=56703AEAC86D3A6FF1249EB99FCA2E32&gwt=pay

Max Zhilyaev this link http://www.nytimes.com/video/dining/100000002663579/the-women-in-the-kitchen.html is the same as this link: http://www.nytimes.com/video/dining/100000002663579/ so, 100000002663579 serves as ID for a video and needs to be eliminated