bitsoffreedom / newspeak

Newspeak van de Nederlandse overheid.
https://rejo.zenger.nl/inzicht/newspeak-van-de-nederlandse-overheid
BSD 3-Clause "New" or "Revised" License
9 stars 4 forks source link

Infrastructure for removing duplicates #2

Open dokterbob opened 11 years ago

dokterbob commented 11 years ago

Sometimes, feed items are duplicated across streams. As to avoid new notifications for existing information, some infrastructure for deduplication has to be figured out.

@rejozenger Examples, please. ^^

rejozenger commented 11 years ago

Unfortunately, I can't include attachments other than images here. I have a link to three different representations of, basically, the same information: https://rejo.zenger.nl/files/newspeak-example-duplication.zip. At least one of the two representations of officielebekendmakingen.nl appeared in the feed https://zoek.officielebekendmakingen.nl/kamervragen_aanhangsel/rss, the publication at rijksoverheid.nl appeared in http://feeds.rijksoverheid.nl/kamerstukken.rss.

dokterbob commented 11 years ago

@rejozenger As RSS is changing, please in the future attach the actual RSS resource (or a dump thereof). The current state of the RSS feed does not provide any reference to the mentioned documents.

Note about documents: the versions on 'officiëlebekendmakingen' have corresponding 'Aanhangselnummer'. This could be a lead to preliminary deduplication.

rejozenger commented 11 years ago

I have uploaded a compressed directory to https://rejo.zenger.nl/files/newspeak.zip. It includes three directories with, for each, a dump of the RSS feed in which the item appeared, a saved version of the HTML file the RSS feed was pointing to and saved PDF file the HTML file was pointing to. It also includes "items-seen.txt" which shows all of the occurrences over time. I haven't investigated, but I am pretty sure not all nine version did appear in the RSS feed (using the legacy code).