Possible candidate for following is the "card" when that is present.
We'd need
Sane timeout to avoid hanging when host of the vacancy is unavailable or blocking.
TXT/HTML checking. PDF support for later. Anything else should be disgarded.
Length check. Anything longer than X bytes should be chopped off. 500kb? Timeout will catch many of these too, but a very fast host might still serve us megabytes on which we then choke.
Sanitizer or semantic text-analyzer; so we can parse HTML in a somewhat sane way and remove things like menus, footers, sidebars. What options are there FLOSS for this?
Possible candidate for following is the "card" when that is present.
We'd need