KosmosisDire / obsidian-webpage-export

Export html from single files, canvas pages, or whole vaults. Direct access to the exported HTML files allows you to publish your digital garden anywhere. Focuses on flexibility, features, and style parity.
https://docs.obsidianweb.net/
MIT License
888 stars 77 forks source link

Certain html tags are excluded from RSS feed's description #437

Open TohidN opened 6 months ago

TohidN commented 6 months ago

Description In RSS file's description tags, anchors are included, however tags such as toggle lists, numbered lists and horizontal ruler are striped. Visit lib/rss.xml file and checkout content of <description> tags or follow rss feed. You will see that certain elements such as list are missing and content of math blocks are completely removed.

Describe the solution you'd like option for:

KosmosisDire commented 6 months ago

Many RSS readers do not support html at all and don't parse html content. Thus I want the description to be as close to plain text as possible. I did a lot of experimentation to make sure the RSS description was as readable as possible in multiple environments.

Including the full html for the file isn't really possible, again RSS feeds do not support many html features and if the full html was included it would look completely broken in almost every RSS reader.

The first option you described is basically what I am doing right now but I am keeping the most common tags that are either almost always supported or don't add much clutter if they are not supported, or are particularly important for RSS feeds that do support it

TohidN commented 5 months ago

Thanks for your response @KosmosisDire, here is a few comments I would like to add:

Many RSS readers do not support html at all and don't parse html content. Thus I want the description to be as close to plain text as possible. I did a lot of experimentation to make sure the RSS description was as readable as possible in multiple environments.

You are right that the some html code used in web pages can't be used in RSS, but many supported and useful tags are either stripped of their formatting or removed. I suggest using a MD Parser and HTML convertor module with a few changes in conversion process to get much better results.


Including the full html for the file isn't really possible, again RSS feeds do not support many html features and if the full html was included it would look completely broken in almost every RSS reader.

I suggest performing parsing(getting blocks) and pre-processing(cleaning) on blocks before converting them to rss specific HTML code:

  1. if MD note is not already parsed, parse the MD to get content blocks(each block can be a single text line or multiple text lines for a specific component such as CodeBlocks or BlockQuotes)
  2. Embeds: Find and replace ![[*]] with [[*]] to fix embeds showing as plain text file name.
  3. CallOuts: Find and replace > [!*] and > [!*]- it with >. if first line's content is exactly >, then remove line from text. This will preserve callout titles while removing callout formatting, turning it into normal blockquote.
  4. CodeBlocks: depending on MD-to-HTML module you are using you might want to remove certain CodeBlock descriptors(E.g. \'\'\'mermaid -> ```) before processing. this descriptors may include mermaid which is natively supported, or others added by plugins. Also the MD-to-HTML module should convert them to <pre><code></code></pre> html code for compatibility.
  5. Math Blocks: depending on your MD-to-HTML module changing math blocks($$expression$$) into a code block(replace $$ with ```) might be needed.
  6. Blocks for supported html tags such as lists, horizontal dividers should be converted, otherwise they should be displayed as their raw MD content which is more readable than current state.
    • I suggest parsers because some blocks can start with tabs, representing sub-blocks in a nested list. also some operations are better performed on specific blocks instead of running search and replace on all content. E.g. following block can mess up html output because of ---(I remember this issue happens in web page generation as well and following code would prevent rendering all content after ---):
      ```mermaid
      ---
      title: Node with text
      ---
      flowchart LR
      id1[This is the text in the box]

The first option you described is basically what I am doing right now but I am keeping the most common tags that are either almost always supported or don't add much clutter if they are not supported, or are particularly important for RSS feeds that do support it

  • My request regarding option for creating "Page summery" mainly focuses on adding only the first paragraph(or option for first $n$ words). It helps keep the size of XML file down as some lite RSS readers or devices can't properly handle processing large files. In this case in the RSS Reader's interface, most notes will look like a plain-text excerpt of a blog post with a "continue reading" link. "Full Page" option can have full rendered html content with fixes mentioned above.
KosmosisDire commented 4 months ago

Thanks, there is absolutely a lot that could be done to improve the RSS generation. I am focusing on other features right now, but I am happy to accept pull requests and I might take another swing at it in the farther future. Thanks again!