Closed JpEncausse closed 3 weeks ago
Can you share your code? It works for me when I disable normalization.
For more complicate use case, you can use getExtraEntryFields
to get extractly what you need.
My TheOldReader feed is here : https://theoldreader.com/profile/JpEncausse.rss
The code :
msg.overwatch.feed.options = {
'descriptionMaxLen': 0,
'normalization' : true,
'xmlParserOptions': {
processEntities: false,
htmlEntities: false,
// UGLY HACK ------------------------------------------
tagValueProcessor: (tagName, tagValue, jPath, hasAttributes, isLeafNode) => {
if(tagName == 'description') { return tagValue.replace(/&/g,'§') }
return tagValue;
}
}
}
if (source.ExtraField){
let extrafields = source.ExtraField.split(',')
msg.overwatch.feed.options.getExtraEntryFields = (entry) => {
let extraContent = {}
for (let extra of extrafields){
extra = extra.trim();
extraContent[extra] = entry[extra]
}
return extraContent
}
}
try {
msg.payload = await FeedExtractor.extract(source.FeedURL, msg.overwatch.feed.options, {
'headers': { 'User-Agent': 'Mozilla/5.0 (compatible; Node-RED; AI)' }
})
// UGLY HACK ------------------------------------------
if (msg.payload.entries){
for (let e of msg.payload.entries){
if (e.description) {
e.description = he.decode(e.description.replace(/§/g,'&'))
}
}
}
} catch(ex){
msg.overwatch.feed.lastError = "Feed Parsing Exception: " + ex
node.warn({msg : "Feed Parsing Exception", ex})
return [msg, undefined];
}
Sorry that I was offline a few days.
This feed content is quite standard, with normalization
disabled, the full decoded description is being returned as well.
However, because this feed is something different in the list of 100 feeds you are working with, you still need some way to choose the right function to process it.
if (isTheOldReader(feedUrl)) {
doSpecialFeedHandler(feedUrl)
} else {
doRegularFeedHandler(feedUrl)
}
Here is my tested solution that you can refer to apply to your workflow:
import { extract } from '@extractus/feed-extractor'
const isTheOldReader = (rss) => {
return rss.startsWith('https://theoldreader.com/')
}
const doRegularFeedHandler = (rss) => {
return extract(rss)
}
const doSpecialFeedHandler = async (rss) => {
const result = await extract(rss, {
getExtraEntryFields: (feedEntry) => {
const { description } = feedEntry
return {
description,
}
},
})
return result
}
const runParse = async (rss) => {
return isTheOldReader(rss) ? doSpecialFeedHandler(rss) : doRegularFeedHandler(rss)
}
const feedUrl = 'https://theoldreader.com/profile/JpEncausse.rss'
const data = await runParse(feedUrl)
console.log(data)
Hello, I'm parsing RSS Feeds from TheOldReader shared feed. It seems they try to rewrite the RSS entry and rebuild RSS Items. But for HTML content stored in they encode the text into
<description>
tags:An Item from ArsTechnica :
When I decode the RSS Item with FeedParser all the HTML entities in the descrption field are stripped. What option should I set to avoid the strip of HTML entities ?
I tried the
normalization = false
but it fail and parse nothing. Thanks