canonical / ubuntu.com

The official website for the Ubuntu operating system
https://ubuntu.com
Other
191 stars 189 forks source link

parse error: https://ubuntu.com/blog/feed #13401

Open nobuto-m opened 9 months ago

nobuto-m commented 9 months ago

Summary

https://ubuntu.com/blog/feed fails to be parsed from time to time and it can be confirmed by some public validators too.

https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Fubuntu.com%2Fblog%2Ffeed

This feed does not validate.

'utf-8' codec can't decode byte 0x83 in position 0: invalid start byte (maybe a high-bit character?) [help]

line 1, column 0: XML parsing error: <unknown>:1:0: not well-formed (invalid token) [help]

    ?????ʝD?1?=o?"?æh?ϛ???^u?

Source: https://ubuntu.com/blog/feed

Process

Read the feed from a software.

Current and expected result

Current: The feed cannot be parsed by a software Expected: no parse error and the content of the feed is visible in a software

Screenshot

image

Browser details

NodeOperationError: Non-whitespace before first tag.
Line: 0
Column: 1
Char: �
    at Object.execute (/usr/local/lib/node_modules/n8n/node_modules/n8n-nodes-base/nodes/RssFeedRead/RssFeedRead.node.ts:75:11)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at Workflow.runNode (/usr/local/lib/node_modules/n8n/node_modules/n8n-workflow/src/Workflow.ts:1284:8)
    at /usr/local/lib/node_modules/n8n/node_modules/n8n-core/src/WorkflowExecute.ts:1018:29

image

nobuto-m commented 7 months ago

Ah, it looks like it's intermittently reproducible.

$ for _ in {1..10}; do wget https://ubuntu.com/blog/feed; done
--2024-02-07 11:13:05--  https://ubuntu.com/blog/feed
Resolving ubuntu.com (ubuntu.com)... 2620:2d:4000:1::26, 2620:2d:4000:1::28, 2620:2d:4000:1::27, ...
Connecting to ubuntu.com (ubuntu.com)|2620:2d:4000:1::26|:443... connected.
HTTP request sent, awaiting response... 200 
Length: 116522 (114K) [application/rss+xml]
Saving to: ‘feed’

2024-02-07 11:13:07 (174 KB/s) - ‘feed’ saved [116522/116522]

--2024-02-07 11:13:07--  https://ubuntu.com/blog/feed
Resolving ubuntu.com (ubuntu.com)... 2620:2d:4000:1::28, 2620:2d:4000:1::26, 2620:2d:4000:1::27, ...
Connecting to ubuntu.com (ubuntu.com)|2620:2d:4000:1::28|:443... connected.
HTTP request sent, awaiting response... 200 
Length: unspecified [application/rss+xml]
Saving to: ‘feed.1’

2024-02-07 11:13:09 (137 KB/s) - ‘feed.1’ saved [34371]

...
$ file feed*
feed:   XML 1.0 document, Unicode text, UTF-8 text, with very long lines (1643)
feed.1: data
feed.2: data
feed.3: data
feed.4: data
feed.5: XML 1.0 document, Unicode text, UTF-8 text, with very long lines (1643)
feed.6: XML 1.0 document, Unicode text, UTF-8 text, with very long lines (1643)
feed.7: XML 1.0 document, Unicode text, UTF-8 text, with very long lines (1643)
feed.8: XML 1.0 document, Unicode text, UTF-8 text, with very long lines (1643)
feed.9: XML 1.0 document, Unicode text, UTF-8 text, with very long lines (1643)

$ du -h feed*
116K    feed
36K feed.1
36K feed.2
36K feed.3
36K feed.4
116K    feed.5
116K    feed.6
116K    feed.7
116K    feed.8
116K    feed.9

$ xmllint --noout feed; echo $?
0

$ xmllint --noout feed.1; echo $?
feed.1:1: parser error : Start tag expected, '<' not found
�w
^
1

feed, feed.{5..9} are good, but feed.{1..4} are bad data.

feed_good.xml.gz feed_bad.xml.gz

nobuto-m commented 7 months ago

Filed this too internally since I'm not sure if it's on a content generation side or environment/infra issue. https://portal.admin.canonical.com/C161991/