bitsoffreedom / newspeak

Newspeak van de Nederlandse overheid.
https://rejo.zenger.nl/inzicht/newspeak-van-de-nederlandse-overheid
BSD 3-Clause "New" or "Revised" License
9 stars 4 forks source link

Automatically detect and embed attachments #3

Closed dokterbob closed 11 years ago

dokterbob commented 11 years ago

Many of the feeds' pages actually contain just a PDF file. The aim is to include these as an 'attachment' in the final feed so they can be automatically included in feed readers.

TODO

  1. Find out whether including PDF files in feeds actually results in inline PDF's in feed readers.
  2. If so, implement a smart way to find linked PDF files in destination URL. (If not already present.)
  3. Make sure the PDF links are included in the output feed.
rejozenger commented 11 years ago

As discussed, it would be interesting to have the contents of the article that is pointed to be included in the feed. Sometimes this may be the content of the page itself (http://www.rijksoverheid.nl/nieuws/2012/11/28/nieuwe-maatregel-in-strijd-tegen-kinderpornografie.html), sometimes it's the content of the PDF that is linked on that page (http://www.rijksoverheid.nl/documenten-en-publicaties/kamerstukken/2012/10/03/voortgang-aanpak-kinderpornografie.html).

The content should be like this:

Screen Shot 2012-12-19 at 17 04 57

Alternatively, (the first page of) the PDF should be embedded or the contents of the PDF itself, whatever is more easy to implement.

We are using Newspeak with a variety of clients, including:

dokterbob commented 11 years ago

The approach for this will be:

  1. Getting out there (follow link) and use feed-specific XPath-expression to locate PDF-URL.
  2. Determine PDF filesize as required by feed spec.
  3. Add PDF as feed enclosure to item.

A similar approach will be taken when no textual content is supplied in the description/summary. Perhaps we need an optional switch allowing overriding of existing description/summary fields? Alternately, we could place the crawled textual data in an optional 'content' field for inline display in the feed reader.

dokterbob commented 11 years ago

Above mentioned functionality is implemented and tested for PDF files in government documents. Output also works.