Owyn / CSS2RSS

scrapper script for RSSGuard to make an RSS feed for any website using CSS
21 stars 2 forks source link
rss scrapper

CSS2RSS

scrapper post-process script for RSSGuard ( https://github.com/martinrotter/rssguard )

Arguments - each is a CSS selector ( https://www.w3schools.com/cssref/css_selectors.asp ):

1) item 2) item title (optional - else would use link's text as title) 3) item description (optional - else would use all the text from item as description) 4) item link (optional - else would use 1st found link in the item (or the item itself if it's a link)) 5) item title 2nd part (optional (or if static main title \ multilink option is enabled), else just title, e.g. title is "Batman" and 2nd part is "chapter 94") 6) item date (optional, else it'd all be "just now") - aim this selector either at text nodes (e.g. span) or elements (a, img) with title or alt containing the Date (e.g. "New!" flashing image badges you get the Date when hovering over)

Options for arguments:

Notes:

Limitations:

Installation

1) Have Python 3+ or newer ( https://www.python.org/downloads/ ) installed (and added to PATH during install)

1.2. Have Python Soup ( https://www.crummy.com/software/BeautifulSoup/ ) installed (Win+R -> cmd -> enter -> `pip install beautifulsoup4`)  
1.3. (optional) If you'd like to parse Dates for articles - Have Maya ( https://github.com/timofurrer/maya/ ) installed (Righ click the Start menu -> run powershell as administrator -> cmd -> `pip install maya`)  

3) Put css2rss.py into your data4 folder (so you can call the script with just python css2rss.py, else you'd need to specify full path to the .py file)

data4

Examples

*

url: https://www.foxnews.com/media
script: python css2rss.py ".title > a" (link a right inside an element with title class

*

url: https://kumascans.com/manga/sokushi-cheat-ga-saikyou-sugite-isekai-no-yatsura-ga-marude-aite-ni-naranai-n-desu-ga/
script: python css2rss.py ".eph-num > a" "!Sokushi Cheat" ".chapterdate" ~ ".chapternum"

*

url: https://www.asurascans.com/
script: python css2rss.py "@.uta" "h4" img "li > a" "li > a"

*

url: https://reaperscans.com/
script: python css2rss.py "@div.space-y-4:first-of-type div.relative.bg-white" "p.font-medium" "img" "a.border" "$contents[0]"

image image

*

url: https://reader.kireicake.com/
script: python css2rss.py @.group a[href*='/series/'] .meta_r ".element > .title a" ".element > .title a"

*

url: https://drakescans.com/
script: python css2rss.py "@.page-item-detail" ".post-title a" "img" "span.chapter > a" ~ ".post-on > a,.post-on:not(:has(*))"

*

url: https://manhuaus.com/?s=Wo+Wei+Xie+Di&post_type=wp-manga&post_type=wp-manga
script: python css2rss.py ".latest-chap a" "!I'm an Evil God"