Flutter-Bounty-Hunters / dart-rss

A dart package for parsing RSS & Atom feed
MIT License
21 stars 17 forks source link

Serialize RSS 1.0 documents #46

Open matthew-carroll opened 5 months ago

matthew-carroll commented 5 months ago

This package currently includes many data structures that are parsed from RSS (and RSS extension) XML. However, this behavior only exists as parsing.

Add serialization for RSS 1.0 documents (not RSS 2.0 or Atom).

matthew-carroll commented 5 months ago

I think the existing RSS 1.0 data model is incorrect. Here's an RSS 1.0 basic example from the test directory:

<?xml version="1.0"?>

<rdf:RDF
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns="http://purl.org/rss/1.0/"
>

    <channel rdf:about="http://www.xml.com/xml/news.rss">
        <title>XML.com</title>
        <link>http://xml.com/pub</link>
        <description>XML.com features a rich mix of information and services for the XML community.</description>

        <image rdf:resource="http://xml.com/universal/images/xml_tiny.gif"/>

        <items>
            <rdf:Seq>
                <rdf:li resource="http://xml.com/pub/2000/08/09/xslt/xslt.html"/>
                <rdf:li resource="http://xml.com/pub/2000/08/09/rdfdb/index.html"/>
            </rdf:Seq>
        </items>

        <textinput rdf:resource="http://search.xml.com"/>

    </channel>

    <image rdf:about="http://xml.com/universal/images/xml_tiny.gif">
        <title>XML.com</title>
        <link>http://www.xml.com</link>
        <url>http://xml.com/universal/images/xml_tiny.gif</url>
    </image>

    <item rdf:about="http://xml.com/pub/2000/08/09/xslt/xslt.html">
        <title>Processing Inclusions with XSLT</title>
        <link>http://xml.com/pub/2000/08/09/xslt/xslt.html</link>
        <description>Processing document inclusions with general XML tools can be problematic. This article proposes a way of preserving inclusion information through SAX-based processing.</description>
    </item>

    <item rdf:about="http://xml.com/pub/2000/08/09/rdfdb/index.html">
        <title>Putting RDF to Work</title>
        <link>http://xml.com/pub/2000/08/09/rdfdb/index.html</link>
        <description>
            Tool and API support for the Resource Description Framework
            is slowly coming of age. Edd Dumbill takes a look at RDFDB,
            one of the most exciting new RDF toolkits.
        </description>
    </item>

    <textinput rdf:about="http://search.xml.com">
        <title>Search XML.com</title>
        <description>Search XML.com's XML collection</description>
        <name>s</name>
        <link>http://search.xml.com</link>
    </textinput>

</rdf:RDF>

Here's the spec for RSS 1.0: https://validator.w3.org/feed/docs/rss1.html#s5.5

Yet, here's the property list from rss1_feed.dart:

  final String? title;
  final String? description;
  final String? link;
  final String? image;
  final List<Rss1Item> items;
  final UpdatePeriod? updatePeriod;
  final int? updateFrequency;
  final DateTime? updateBase;
  final DublinCore? dc;

The parsing behavior is as follows:

final document = XmlDocument.parse(xmlString);
    XmlElement rdfElement;
    try {
      rdfElement = document.findAllElements('rdf:RDF').first;
    } on StateError {
      throw ArgumentError('channel not found');
    }

    final channel = rdfElement.findElements('channel');
    return Rss1Feed(
      title: findElementOrNull(rdfElement, 'title')?.innerText,
      link: findElementOrNull(rdfElement, 'link')?.innerText,
      description: findElementOrNull(rdfElement, 'description')?.innerText,
      items: rdfElement.findElements('item').map((element) => Rss1Item.parse(element)).toList(),
      image: findElementOrNull(rdfElement, 'image')?.getAttribute('rdf:resource'),
      updatePeriod: _parseUpdatePeriod(
        findElementOrNull(rdfElement, 'sy:updatePeriod')?.innerText,
      ),
      updateFrequency: parseInt(
        findElementOrNull(rdfElement, 'sy:updateFrequency')?.innerText,
      ),
      updateBase: parseDateTime(
        findElementOrNull(rdfElement, 'sy:updateBase')?.innerText,
      ),
      dc: channel.isEmpty ? null : DublinCore.parse(rdfElement.findElements('channel').first),
    );

We can see that this object parses the whole document, so it should capture enough information to recover the document, but it doesn't.

We can see that the parser pulls the title, description and link from the top-level RDF element, as it should.

We can see that the parse collects and parses all the top-level items within the RDF element, as it should.

However, the top-level image is reduced to a single attribute, despite the fact that the image can contain a title, link, and url. So we seem to be losing information. Based on a quick check of the spec, it looks like this parser might be confusing two different images. There's an image element under the RDF element, which is the one we want. Then there's an image element under the channel element. This parser is treating the image like a channel version, but it should be treating it like an RDF element.

Also, the textinput top-level element isn't parsed at all, despite being a part of the specification.

matthew-carroll commented 5 months ago

We need to fix the RSS 1.0 data model before serializing it. Blocked on #47