extractus / feed-extractor

Simplest way to read & normalize RSS/ATOM/JSON feed data
https://extractor-demos.pages.dev/feed-extractor
MIT License
164 stars 33 forks source link

parserOption to keep HTML Entities #140

Closed JpEncausse closed 3 weeks ago

JpEncausse commented 1 month ago

Hello, I'm parsing RSS Feeds from TheOldReader shared feed. It seems they try to rewrite the RSS entry and rebuild RSS Items. But for HTML content stored in they encode the text into <description> tags:

An Item from ArsTechnica :

<description>&lt;div&gt;
&lt;figure&gt;
  &lt;img src="https://cdn.arstechnica.net/wp-content/uploads/2024/04/image-2-800x533.jpeg" alt="Image of a chip with a device on it that is shaped like two triangles connected by a bar."&gt;
      &lt;p&gt;&lt;a href="https://cdn.arstechnica.net/wp-content/uploads/2024/04/image-2-scaled.jpeg"&gt;Enlarge&lt;/a&gt; / Quantinuum's H2 "racetrack" quantum processor. (credit: Quantinuum)&lt;/p&gt;  &lt;/figure&gt;
&lt;div&gt;&lt;a&gt;&lt;/a&gt;&lt;/div&gt;
&lt;p&gt;On Tuesday, Microsoft made a series of announcements related to its Azure Quantum Cloud service. Among them was a demonstration of logical operations using the largest number of error-corrected qubits yet.&lt;/p&gt;
&lt;p&gt;"&lt;a href="https://arstechnica.com/science/2024/04/quantum-error-correction-used-to-actually-correct-errors/"&gt;Since April&lt;/a&gt;, we've tripled the number of logical qubits here," said Microsoft Technical Fellow Krysta Svore. "So we are accelerating toward that hundred-logical-qubit capability." The company has also lined up a new partner in the form of Atom Computing, which uses neutral atoms to hold qubits and has already demonstrated hardware with over 1,000 hardware qubits.&lt;/p&gt;
&lt;p&gt;Collectively, the announcements are the latest sign that quantum computing has emerged from its infancy and is rapidly progressing toward the development of systems that can reliably perform calculations that would be impractical or impossible to run on classical hardware. We talked with people at Microsoft and some of its hardware partners to get a sense of what's coming next to bring us closer to useful quantum computing.&lt;/p&gt;
&lt;/div&gt;&lt;p&gt;&lt;a href="https://arstechnica.com/?p=2048754#p3"&gt;Read 20 remaining paragraphs&lt;/a&gt; | &lt;a href="https://arstechnica.com/?p=2048754&amp;amp;comments=1"&gt;Comments&lt;/a&gt;&lt;/p&gt;</description>

When I decode the RSS Item with FeedParser all the HTML entities in the descrption field are stripped. What option should I set to avoid the strip of HTML entities ?

I tried the normalization = false but it fail and parse nothing. Thanks

ndaidong commented 1 month ago

Can you share your code? It works for me when I disable normalization. For more complicate use case, you can use getExtraEntryFields to get extractly what you need.

Screenshot from 2024-09-11 23-20-04

JpEncausse commented 1 month ago

My TheOldReader feed is here : https://theoldreader.com/profile/JpEncausse.rss

The code :

msg.overwatch.feed.options = { 
    'descriptionMaxLen': 0,
    'normalization' : true,
    'xmlParserOptions': { 
        processEntities: false, 
        htmlEntities: false,

        // UGLY HACK ------------------------------------------
        tagValueProcessor: (tagName, tagValue, jPath, hasAttributes, isLeafNode) => {
            if(tagName == 'description') { return tagValue.replace(/&/g,'§') }
            return tagValue;
        }
    }
}

if (source.ExtraField){
    let extrafields = source.ExtraField.split(',')
    msg.overwatch.feed.options.getExtraEntryFields = (entry) => {
        let extraContent = {}
        for (let extra of extrafields){
            extra = extra.trim();
            extraContent[extra] = entry[extra]
        }
        return extraContent
    }
}

try {
    msg.payload = await FeedExtractor.extract(source.FeedURL, msg.overwatch.feed.options,  {
        'headers': { 'User-Agent': 'Mozilla/5.0 (compatible; Node-RED; AI)' }
    })

    // UGLY HACK ------------------------------------------
    if (msg.payload.entries){
        for (let e of msg.payload.entries){
            if (e.description) { 
                e.description = he.decode(e.description.replace(/§/g,'&'))
            }
        }
    }

} catch(ex){
    msg.overwatch.feed.lastError = "Feed Parsing Exception: " + ex
    node.warn({msg : "Feed Parsing Exception", ex})
    return [msg, undefined];
}
ndaidong commented 1 month ago

Sorry that I was offline a few days.

This feed content is quite standard, with normalization disabled, the full decoded description is being returned as well. However, because this feed is something different in the list of 100 feeds you are working with, you still need some way to choose the right function to process it.

if (isTheOldReader(feedUrl)) {
 doSpecialFeedHandler(feedUrl)
} else {
 doRegularFeedHandler(feedUrl)
}

Here is my tested solution that you can refer to apply to your workflow:

import { extract } from '@extractus/feed-extractor'

const isTheOldReader = (rss) => {
  return rss.startsWith('https://theoldreader.com/')
}

const doRegularFeedHandler = (rss) => {
  return extract(rss)
}

const doSpecialFeedHandler = async (rss) => {
  const result = await extract(rss, {
    getExtraEntryFields: (feedEntry) => {
      const { description } = feedEntry
      return {
        description,
      }
    },
  })
  return result
}

const runParse = async (rss) => {
  return isTheOldReader(rss) ? doSpecialFeedHandler(rss) : doRegularFeedHandler(rss)
}

const feedUrl = 'https://theoldreader.com/profile/JpEncausse.rss'

const data = await runParse(feedUrl)
console.log(data)

Screenshot from 2024-09-16 15-03-20