alvra / django-spotnet

A Django app to manage and download posts from Spotnet.
GNU General Public License v3.0
7 stars 6 forks source link

Skipped spots because of invalid xml data #4

Open hagst opened 11 years ago

hagst commented 11 years ago

Spots with & between <Description> </Description> in the header without ![CDATA[ ]] are skipped

message-ID: NNTP-Posting-Date: Thu, 15 Nov 2012 08:47:33 UTC

Skipped invalid post <>: Post has invalid XML data for header X-XML: not well-formed (invalid token): line 1, column 238. The problem is &.

An other spot with the same problem: message-ID: Date: 17 Nov 2012 16:41:49 GMT

Skipped invalid post <>: Post has invalid XML data for header X-XML: not well-formed (invalid token): line 1, column 102. The problem is &.

alvra commented 11 years ago

Thanks for the report, not sure if this is a bug in spotnet itself (as opposed to the nntp post xml) but it seems fixable so I'll have a look one of these days.

Feel free to report on any fixable errors in updating, even if they're not strictly spotnet errors. I'd like to be able to parse as many posts as possible!

hagst commented 11 years ago

573d53b: Allow some extra posts that contain formally invalid xml to be parsed. This fix works and i haven't experienced any side affects. Thanks for fixing.

hagst commented 11 years ago

Spots with & between <Image> </Image> in the header without ![CDATA[ ]] are skipped

Post has invalid XML data for header X-XML: not well-formed (invalid token): line 1, column 832 =& message-id Date: 18 Nov 2012 18:05:26 GMT

Post has invalid XML data for header X-XML: not well-formed (invalid token): line 1, column 819 =& message-id Date: 18 Nov 2012 16:37:03 GMT

Post has invalid XML data for header X-XML: not well-formed (invalid token): line 1, column 1402 =& message-id Date: 13 Nov 2012 18:05:32 GMT

Spots with & between <Tag> </Tag> in the header without ![CDATA[ ]] are skipped

Post has invalid XML data for header X-XML: not well-formed (invalid token): line 1, column 109 =& message-id Date: 30 Oct 2012 13:22:11 GMT

Spots with & between <Website></Website> in the header without ![CDATA[ ]] are skipped

Post has invalid XML data for header X-XML: not well-formed (invalid token): line 1, column 913 = & message-id Date: 26 Oct 2012 09:30:29 GMT  Spots with between <Description> </Description> in the header are skipped Post has invalid XML data for header X-XML: not well-formed (invalid token): line 1, column 203 message-id Date: 08 Nov 2012 11:34:10 GMT

Skipped spot because of splitting 2 X-XML lines over multiples lines Post has invalid XML data for header X-XML: unclosed CDATA section: line 1, column 316 message-id Date: 17 Sep 2012 08:11:23 GMT Fix: Add X-XML: for the splitted lines or merge everything between X-XML: to the next X-XML resulting in 2 X-XML:.

alvra commented 11 years ago

Except for the one with the error in parsing the Description, these should now be fixed.

hagst commented 11 years ago

Test results after updating Spotnet to version 4e4fe6a The spots mentioned above are now valid including the spot with the error in parsing the Description.

Spot with in the header Post has invalid XML data for header X-XML: not well-formed (invalid token): line 1, column 1853 messageid Date: Thu, 19 Jul 2012 18:22:00 GMT header shows <Image> <![CATA[http://x.x.x.jpg]]> </Image>

Spot with % in the header Post has invalid XML data for header X-XML: not well-formed (invalid token): line 1, column 64 =% messageid Date: Fri, 13 Jul 2012 08:53:03 GMT Header shows <Post%r>name</Poster> instead of <Poster>name</Poster>

Some spots with part of the header in the body Header shows <Description> and no </Description>. Result: incomplete description, missing <Website>, <Image>, <Category>, <NZB>, </Posting>, </Spotnet> The missing part of the header is listed as first in the body.

Post has invalid XML data for header X-XML: unclosed CDATA section: line 1, column 188 messageid Date: Mon, 09 Jul 2012 21:10:33 +0200 Organization: Newsgrabber

Post has invalid XML data for header X-XML: unclosed CDATA section: line 1, column 197 messageid Organization: Newsgrabber Date: Tue, 11 Sep 2012 22:28:26 +0200 Same spotter and same problem

Post has invalid XML data for header X-XML: unclosed CDATA section: line 1, column 339 messageid Date: Wed, 01 Aug 2012 16:23:48 -0500 Spot placed using Supernews with SSL (known Supernews problem)

Some spots with incomplete header Header shows <Description> and no </Description>. Result: incomplete description, missing <Website>, <Image>, <Category>, <NZB>, </Posting>, </Spotnet> The missing part of the header is also not listed in the body.

Post has invalid XML data for header X-XML: unclosed CDATA section: line 1, column 1766 messageid Date: Tue, 17 Jul 2012 13:00:16 +0200 Organization: Newsxs (Secured through NewsXS SSL)

Post has invalid XML data for header X-XML: unclosed CDATA section: line 1, column 1766 messageid Date: Tue, 17 Jul 2012 13:01:55 +0200 Organization: Newsxs (Secured through NewsXS SSL)