aflin / rampart

Old school JavaScript with fast C full text search, sql, lmdb, redis, https, websockets, curl, crypt and more.
https://rampart.dev/
Other
26 stars 2 forks source link

2023-11: Raspberry Pi News Demo fixup #1

Open paulwratt opened 7 months ago

paulwratt commented 7 months ago

There are 3 URL's that are used to gather the news items for the Raspberry Pi News demo.

A couple of (3-6 maybe?) months ago the MakeUseOf website dropped their raspberry%20pi section (a site redesign maybe? cant remember, but there was no content gathered), but today their website is a non-functioning blackhole of 503 gateway errors (by that I mean there is no timeout on that web page), so the update aggregator script basically hangs waiting for the server to finish responding, and of coarse it does not.

I can post a PR, but in case there is no response for some time (this repo was last updated 3 months ago), I'll post the raw changes here, for anyone else wanting to fix their own demo version.

As a side affect, I finally figured out how to add another content collection URL: https://www.cnx-software.com/news/raspberry-pi/ which I have been wanting to do ever since I first set up this Rampart demo a couple of years ago, so here is that as well.

Note that someone who knows the aggregator better than I do needs to look over the actual html source, as although it is a WordPress site, it is a bit quirky (I compared it to the RaspberryPi.Org News html source to get a better understanding of exactly class items work) in that it contains inline (not really) "thumbnails", and a whole bunch of <figure> tags, so I am not really sure if I got everything into the contentRemoveClass that should be there.

pi_news_aggregator.js:

.... <snip-line-70>
    },
    "cnxsoftware":
    {
        name: "cnxsoftware",
        url: "https://www.cnx-software.com/news/raspberry-pi/",
        urlNextFmt: "https://www.cnx-software.com/news/raspberry-pi/page/%d/",
        initialPages: 8,
        entryClass: "entry-title",
        entryImgClass: "attachment-post-thumbnail",
        contentClass: "entry-content",
        contentRemoveClass: ["saboxplugin-wrap", "wp-caption", "youtube-player"]
    }
/* 2023-11-15 : currently broken, 503 gateway error with _no_ timeout
    "makeuseof":
    {
        name: "makeuseof",
        url: "https://www.makeuseof.com/search/raspberry%20pi/",
        urlNextFmt: "https://www.makeuseof.com/search/raspberry%%20pi/%d/",
        initialPages: 3,
        entryClass: "bc-title-link",
        entryImgClass: "bc-img",
        contentClass: "article",
        contentRemoveClass: ["sidebar", "next-btn", "sharing", "letter-from"]
    }
*/
}

I left the MakeUseOf section in, just in case they come back online again (or recover a backup of their hacked WordPress site maybe ???), as it might be adapted to Archive.Org for anyone that needs it.

That brings me to one final question, how to extend the date for inclusion of content, I set the initial pages to 8, but the date the aggregator goes back to is only on the 3rd-5th-ish page (I think), about 2 weeks back to the 30th of last month (it being the 15th today).

BTW: for anyone that does not know, CNX-Software covers RaspberryPi-a-like SBC and other small devices (like those based on the Espressif ESP SOCs & SOMs).

Cheers

Paul

paulwratt commented 7 months ago

It is possible that I changed something on the default search.html URL, as I (today) bumped the output of 12 items to 48 items, because I use that "landing page" in a similar way to a regular news feed on a phone (the search just allows me to look back over older content, unlike a news feed - yay for more useless applications :).

If that was part of one of the changes from when I was last here, I can not remember, and I am missing the 2 most recent branches in my fork, so it might have been a year or 2 since I last checked up on Rampart.

paulwratt commented 7 months ago

rampart-pi-news-pw

there are 3 items above these that dont show the pictures, but this screenshot shows that some of the CNX-Software images do show up ... without looking directly into the database I cant confirm if they are inline src images or urls

FWIW: yes, that is a scrot screenshot of Chromium 92 on 32bit Buster on RPi4, cropped with the menu version of imagemagick

aflin commented 7 months ago

This was meant to be a quick demo and a part of a tutorial. Since it is scraping other sites, it was bound to break eventually. To be production quality, it would need checks for broken sources and daily monitoring. But thanks for letting me know. I'll have a look at it and update it and the tutorial. I'll also check out the new source. Might take a week or two.