Closed jimpo closed 11 years ago
To make this more complicated, it looks like at some point old site URL's were longer. Someone posted a support message (on October 8, 2012) with this URL: http://www.dukechronicle.com/article/potti-letter-addresses-scandal-former-researcher-calls-rhodes-controversy-misunderstanding
which now redircts to: http://www.dukechronicle.com/article/potti-letter-addresses-scandal-former-researcher-c
This article probably had a lot of facebook recommends, which are now lost because it was changed (I guess cutting them all was wrong)
I think the right solution might be to do an open graph search on every URL we have used, look at the recommends, and pick the URL we want to use with OG as the one with the most. Then like we discussed we save this URL to use with OG. The new URL format can be saved as canonical, and all old URLS (including the OG one) will redirect to the new /articles/ url. If we put the OG url in the OG url field, I think it will work in identifying that page by that ID as long as we keep all the redirects. Switching to /articles will make this much easier since we know every route to /article will need to be 301'd.
And in the future, we save the first URL as the OG url, and always put that in the OG tag, regardless of if the actual URL has changed.
URL schemes for popular news sites to contemplate:
NYT has no fallbacks: http://www.nytimes.com/2013/02/05/us/politics/solicitor-generals-predicament-on-gay-marriage.html
Washington Post, Everything is bullshit but the hash at the end (visit in incognito, too many visits and they'll want you to pay) http://www.washingtonpost.com/world/national-security/kerry-takes-the-helm-at-state-signaling-early-push-on-peace-in-middle-east/2013/02/04/f7c074d2-6efd-11e2-8b8d-e0b59a1b8e2a_story.html http://www.washingtonpost.com/f7c074d2-6efd-11e2-8b8d-e0b59a1b8e2a_story.html
NPR: The ID and year are all you need, redirects to the original http://www.npr.org/2013/02/04/170482802/are-mini-reactors-the-future-of-nuclear-power http://www.npr.org/2013///170482802/
Tech Crunch/TIME/WSJ Blogs: Handles chewed up URLs, redirects to the original until it cant differentiate, then it just picks one (perhaps alphabetical?) http://techcrunch.com/2013/02/04/jawbone-acquires-mobile-health-startup-massive-health-for-tens-of-millions/ http://techcrunch.com/2013/02/04/jawbone-acquires- http://techcrunch.com/2013/02/04/jawb http://techcrunch.com/2013/02/04/ja--> jailbreaking-is-back-new-evasi0n-software-works-on-most-ios-6-06-1-devices-including-iphone-5
USA Today uses a prefix (/story/) and an ID at the end to find an article, but throws a 404 if you dont follow their crappy regex. Does not redirect http://www.usatoday.com/story/gameon/2013/02/04/ravens-lost-super-bowl-trophy/1891285/ http://www.usatoday.com/story/lalapoopoo/0000/00/00/weeeeeeeeeeee/1891285/
LA Times dies if it doesn't match, date and ID is at the end, Chicago Tribune uses the same CMS http://www.latimes.com/business/technology/la-fi-tn-hp-chromebook-pavilion-official-20130204,0,6726984.story
Reuters http://www.reuters.com/article/2013/02/04/us-usa-guns-obama-idUSBRE9130KL20130204
CNN is a mixed bag. Their blogs behave like TC/WSJ/Time (probably all wordpress) http://news.blogs.cnn.com/2013/02/04/ahmadinejad-jokes-hed-volunteer-to-go-to-space/ http://news.blogs.cnn.com/2013/02/04/ahmadinejad-
...but cnn front page content cant be chomped http://edition.cnn.com/2013/02/04/world/europe/spain-corruption-scandals/index.html?hpt=hp_t2 http://edition.cnn.com/2013/02/04/world/europe/spain-c DIES
Huffington Post just needs the ID at the end, redirects to the original http://www.huffingtonpost.com/2013/02/04/congress-jobs_n_2615210.html http://www.huffingtonpost.com/_n_2615210.html
Forbes gives their authors a namespace http://www.forbes.com/sites/halahtouryalai/2013/02/04/us-to-sue-sp-but-not-moodys-not-goldman-what-gives/
On Mon, Feb 4, 2013 at 3:33 PM, Glenn Rivkees notifications@github.comwrote:
To make this more complicated, it looks like at some point old site URL's were longer. Someone posted a support message (on October 8, 2012) with this URL: http://www.dukechronicle.com/article/potti-letter-addresses-scandal-former-researcher-calls-rhodes-controversy-misunderstanding
which now redircts to: http://www.dukechronicle.com/article/potti-letter-addresses-scandal-former-researcher-c
This article probably had a lot of facebook recommends, which are now lost because it was changed (I guess cutting them all was wrong)
I think the right solution might be to do an open graph search on every URL we have used, look at the recommends, and pick the URL we want to use with OG as the one with the most. Then like we discussed we save this URL to use with OG. The new URL format can be saved as canonical, and all old URLS (including the OG one) will redirect to the new /articles/ url. If we put the OG url in the OG url field, I think it will work in identifying that page by that ID as long as we keep all the redirects. Switching to /articles will make this much easier since we know every route to /article will need to be 301'd.
And in the future, we save the first URL as the OG url, and always put that in the OG tag, regardless of if the actual URL has changed.
— Reply to this email directly or view it on GitHubhttps://github.com/thechronicle/chronline/issues/59#issuecomment-13097713.
I think we should discuss it at the meeting Sunday.
We can figure out the the number of shares by doing a call to open graph: /?ids=http://www.dukechronicle.com/article/potti-letter-addresses-scandal-former-researcher-calls-rhodes-controversy-misunderstanding
which returns something like: { "http://www.dukechronicle.com/article/potti-letter-addresses-scandal-former-researcher-calls-rhodes-controversy-misunderstanding": { "id": "http://www.dukechronicle.com/article/potti-letter-addresses-scandal-former-researcher-calls-rhodes-controversy-misunderstanding", "shares": 3 } }
So once we go through and figure out what the new routing is going to be, we can write a script to find out which of the previous URLS should be the og:url.
It looks like fb is now able to follow redirects, and still should use whatever is in the og:url tag for the id, so... I don't think we have to maintain an empty page at the old /article/ url for og, but we will need to double check after implementation.
Instead of /article/--, the route should be /articles////. Also consider the issue of changing slugs.