scrapper post-process script for RSSGuard ( https://github.com/martinrotter/rssguard )
1) item
2) item title (optional - else would use link's text as title)
3) item description (optional - else would use all the text from item as description)
4) item link (optional - else would use 1st found link in the item (or the item itself if it's a link))
5) item title 2nd part (optional (or if static main title \ multilink option is enabled), else just title, e.g. title is "Batman" and 2nd part is "chapter 94")
6) item date (optional, else it'd all be "just now") - aim this selector either at text nodes (e.g. span
) or elements (a
, img
) with title
or alt
containing the Date (e.g. "New!" flashing image badges you get the Date when hovering over)
1) item
- @
at start - enables searching for multiple links inside the found item, e.g. one div
item and multiple a
links inside it and you want it as separate feed items1) item
- ~
as the whole argument - to let the script decide what to do (default action) - e.g. use 1st found link inside the item, use whole text inside the item as the description etc (not actually an option, but rather a format for the argument line), e.g. python css2rss.py div.itemclass ~ span.description
(here link's inner text (2nd argument) will be used as the title by default action but description is being looked for (3rd argument))2) title
, 5) item title 2nd part
and 3) item description
- !
at start - makes it a static specified value (after the !), e.g. "!my title"
, if you make 1st part of the title fixed then 2nd part title addon would get auto-enabled and it would use text inside the found link as the 2nd part (unless you specify what to use manually as the 5th argument)2) title
, 5) item title 2nd part
- $
at start - executes a python code expression instead of using CSS selectors, uses found item link as a starting point and takes text
from it eval("tLink."+your_inputted_argument).text
, see https://www.crummy.com/software/BeautifulSoup/bs4/doc/ for things you can do with it - e.g. go one level up (to the parent element) or to the next element - or select elements CSS selectors can't select, see example below6) date
- ?
at start - tells the parser that you're expecting an Americal format of date - "Month/Day/Year"1) item
is searched in the whole document and the rest is searched inside the item
document node (but you can make the item
point right at the a
hyperlink - it will be used by default)
use space ` as the separator for arguments if they contain no spaces themselves, else (if they do) also enclose such arguments into quotation marks
", e.g.
python css2rss.py div.class "div.subclass > h1.title" span.description(btw, you can also enclose arguments without any spaces into brackets if you'd like) **Warning**: starting from RSSGuard v4.5.2 which supports single quotation marks as well
'you have to either use single quotation marks instead
'to enclose arguments to pass them as is or escape backslashes and double-quotes with backslashes, e.g.
python css2rss.py "\:argument starting with\:"or
python css2rss.py '\:argument starting with\:'`
if no item is found - a feed item would be generated with the html dump of the whole page so you could see what could be wrong (e.g. - cloudflare block page)
content you need to log-in first to see is available
right click -> view page source
) would get scrapped.
1) Have Python 3+ or newer ( https://www.python.org/downloads/ ) installed (and added to PATH during install)
1.2. Have Python Soup ( https://www.crummy.com/software/BeautifulSoup/ ) installed (Win+R -> cmd -> enter -> `pip install beautifulsoup4`)
1.3. (optional) If you'd like to parse Dates for articles - Have Maya ( https://github.com/timofurrer/maya/ ) installed (Righ click the Start menu -> run powershell as administrator -> cmd -> `pip install maya`)
3) Put css2rss.py into your data4
folder (so you can call the script with just python css2rss.py
, else you'd need to specify full path to the .py
file)
url: https://www.foxnews.com/media
script: python css2rss.py ".title > a"
(link a
right inside an element with title
class
url: https://kumascans.com/manga/sokushi-cheat-ga-saikyou-sugite-isekai-no-yatsura-ga-marude-aite-ni-naranai-n-desu-ga/
script: python css2rss.py ".eph-num > a" "!Sokushi Cheat" ".chapterdate" ~ ".chapternum"
url: https://www.asurascans.com/
script: python css2rss.py "@.uta" "h4" img "li > a" "li > a"
url: https://reaperscans.com/
script: python css2rss.py "@div.space-y-4:first-of-type div.relative.bg-white" "p.font-medium" "img" "a.border" "$contents[0]"
url: https://reader.kireicake.com/
script: python css2rss.py @.group a[href*='/series/'] .meta_r ".element > .title a" ".element > .title a"
a
element (the "New!" badge) with date inside its tooltip (title
or alt
) OR for a span
element without any child nodes (both these elements are of class .post-on
url: https://drakescans.com/
script: python css2rss.py "@.page-item-detail" ".post-title a" "img" "span.chapter > a" ~ ".post-on > a,.post-on:not(:has(*))"
url: https://manhuaus.com/?s=Wo+Wei+Xie+Di&post_type=wp-manga&post_type=wp-manga
script: python css2rss.py ".latest-chap a" "!I'm an Evil God"