DocNow / diffengine

track changes to the news, where news is anything with an RSS feed
MIT License
177 stars 30 forks source link

Non-canonical URLs pointing at the same URL from multiple feeds results in duplicates #17

Closed ryanfb closed 7 years ago

ryanfb commented 7 years ago

For example: https://twitter.com/search?f=tweets&q=fox_diff%20Former%20President%20Bush%20intensive%20care&src=typd

All the Archive URLs are the same. However, in the logs I see both:

checking http://feedproxy.google.com/~r/foxnews/politics/~3/_eSUNInb95I/bush-41-cant-make-inauguration-tells-trump-sitting-outside-could-put-me-six-feet-under.html
checking http://feedproxy.google.com/~r/foxnews/most-popular/~3/_eSUNInb95I/bush-41-cant-make-inauguration-tells-trump-sitting-outside-could-put-me-six-feet-under.html

Which I'm guessing is how these duplicates are getting tweeted.

Maybe we need to do some URL de-referencing/canonicalization before storing/checking URLs from feeds? If I curl -I those feedproxy URLs I get a 301 response with a semi-canonical URL in the location (would need to have parameters stripped).

edsu commented 7 years ago

Thanks for this! I think that the canonical URL of the EntryVersion needs to be used to find the latest version, rather than than the Entry. See this code.