Flutter-Bounty-Hunters / dart-rss

A dart package for parsing RSS & Atom feed
MIT License
23 stars 19 forks source link

Fix RSS 1.0 data modeling #47

Open matthew-carroll opened 10 months ago

matthew-carroll commented 10 months ago

While working on RSS 1.0 serialization, I discovered that the data model seems to be wrong and incomplete. We should fix the data model so that it captures all information from an RSS 1.0 document.

Here's a copy of what I found during working on serialization:

I think the existing RSS 1.0 data model is incorrect. Here's an RSS 1.0 basic example from the test directory:

<?xml version="1.0"?>

<rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns="http://purl.org/rss/1.0/"
>

    <channel rdf:about="http://www.xml.com/xml/news.rss">
        <title>XML.com</title>
        <link>http://xml.com/pub</link>
        <description>XML.com features a rich mix of information and services for the XML community.</description>

        <image rdf:resource="http://xml.com/universal/images/xml_tiny.gif"/>

        <items>
            <rdf:Seq>
                <rdf:li resource="http://xml.com/pub/2000/08/09/xslt/xslt.html"/>
                <rdf:li resource="http://xml.com/pub/2000/08/09/rdfdb/index.html"/>
            </rdf:Seq>
        </items>

        <textinput rdf:resource="http://search.xml.com"/>

    </channel>

    <image rdf:about="http://xml.com/universal/images/xml_tiny.gif">
        <title>XML.com</title>
        <link>http://www.xml.com</link>
        <url>http://xml.com/universal/images/xml_tiny.gif</url>
    </image>

    <item rdf:about="http://xml.com/pub/2000/08/09/xslt/xslt.html">
        <title>Processing Inclusions with XSLT</title>
        <link>http://xml.com/pub/2000/08/09/xslt/xslt.html</link>
        <description>Processing document inclusions with general XML tools can be problematic. This article proposes a way of preserving inclusion information through SAX-based processing.</description>
    </item>

    <item rdf:about="http://xml.com/pub/2000/08/09/rdfdb/index.html">
        <title>Putting RDF to Work</title>
        <link>http://xml.com/pub/2000/08/09/rdfdb/index.html</link>
        <description>
            Tool and API support for the Resource Description Framework
            is slowly coming of age. Edd Dumbill takes a look at RDFDB,
            one of the most exciting new RDF toolkits.
        </description>
    </item>

    <textinput rdf:about="http://search.xml.com">
        <title>Search XML.com</title>
        <description>Search XML.com's XML collection</description>
        <name>s</name>
        <link>http://search.xml.com</link>
    </textinput>

</rdf:RDF>

Here's the spec for RSS 1.0: https://validator.w3.org/feed/docs/rss1.html#s5.5

Yet, here's the property list from rss1_feed.dart:

  final String? title;
  final String? description;
  final String? link;
  final String? image;
  final List<Rss1Item> items;
  final UpdatePeriod? updatePeriod;
  final int? updateFrequency;
  final DateTime? updateBase;
  final DublinCore? dc;

The parsing behavior is as follows:

final document = XmlDocument.parse(xmlString);
    XmlElement rdfElement;
    try {
      rdfElement = document.findAllElements('rdf:RDF').first;
    } on StateError {
      throw ArgumentError('channel not found');
    }

    final channel = rdfElement.findElements('channel');
    return Rss1Feed(
      title: findElementOrNull(rdfElement, 'title')?.innerText,
      link: findElementOrNull(rdfElement, 'link')?.innerText,
      description: findElementOrNull(rdfElement, 'description')?.innerText,
      items: rdfElement.findElements('item').map((element) => Rss1Item.parse(element)).toList(),
      image: findElementOrNull(rdfElement, 'image')?.getAttribute('rdf:resource'),
      updatePeriod: _parseUpdatePeriod(
        findElementOrNull(rdfElement, 'sy:updatePeriod')?.innerText,
      ),
      updateFrequency: parseInt(
        findElementOrNull(rdfElement, 'sy:updateFrequency')?.innerText,
      ),
      updateBase: parseDateTime(
        findElementOrNull(rdfElement, 'sy:updateBase')?.innerText,
      ),
      dc: channel.isEmpty ? null : DublinCore.parse(rdfElement.findElements('channel').first),
    );

We can see that this object parses the whole document, so it should capture enough information to recover the document, but it doesn't.

We can see that the parser pulls the title, description and link from the top-level RDF element, as it should.

We can see that the parse collects and parses all the top-level items within the RDF element, as it should.

However, the top-level image is reduced to a single attribute, despite the fact that the image can contain a title, link, and url. So we seem to be losing information. Based on a quick check of the spec, it looks like this parser might be confusing two different images. There's an image element under the RDF element, which is the one we want. Then there's an image element under the channel element. This parser is treating the image like a channel version, but it should be treating it like an RDF element.

Also, the textinput top-level element isn't parsed at all, despite being a part of the specification.

toseefkhan403 commented 8 months ago

Hi @matthew-carroll, can I take up this issue?

matthew-carroll commented 8 months ago

@toseefkhan403 I'm already working on it. If you also need this work to be done, please be sure to describe the situation you're facing and why this change would be useful for you.