apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
MIT License
3.83k stars 354 forks source link

Add a space where two tags met #49

Open danschumann opened 9 years ago

danschumann commented 9 years ago

I'm wondering if there is a way to create a space wherever tags were.

Picture this:

<div>Some sentence.</div><div>Some other Sentence</div>

It converts to Some sentence.Some other Sentence when I run

text = sanitizeHtml(text, {allowedTags:[], allowedAttribute: {}});

Is there an option to add whitespace so the output is better: Some sentence. Some other Sentence

dgrad commented 8 years ago

I think to do this properly you'll need a list of block tags (or a list of inline tags). You want to add a space wherever a block tag ends (actually it should probably be a newline character and let the browser convert it to space), but not where an inline tag ends (e.g. <div>foo</div><div>bar</div> should convert to foo bar, but <span>foo</span><span>bar</span> should convert to foobar).

boutell commented 8 years ago

I agree, and there should be an option to override that list.

I'd take a pull request for this one.

On Fri, Nov 27, 2015 at 12:42 PM, Daniel Grad notifications@github.com wrote:

I think to do this properly you'll need a list of block tags (or a list of inline tags). You want to add a space wherever a block tag ends (actually it should probably be a newline character and let the browser convert it to space), but not where an inline tag ends (e.g.

foo
bar
should convert to foo bar, but foobar should convert to foobar). — Reply to this email directly or view it on GitHub https://github.com/punkave/sanitize-html/issues/49#issuecomment-160181276 .

THOMAS BOUTELL, DEV & OPS P'UNK AVENUE | (215) 755-1330 | punkave.com

SystemDisc commented 7 years ago

Until this gets implemented, a messy hack would be:

text = text.replace(/>/g, '> ');
text = sanitizeHtml(text, {allowedTags:[]});
rafacustodio commented 7 years ago

up!!

Is this still on?

boutell commented 7 years ago

@r-custodio As mentioned, I'd take a PR for this. Unfortunately as maintainer I can't necessarily implement every feature.

greghub commented 3 years ago

@abea this is labeled as seeking contributions but closed. Is it still something you'd accept a PR for?

abea commented 3 years ago

@greghub Sure. It had been sitting idle for years, so there didn't seem much reason to keep it open. I'll reopen it if you want to work on it for 2.x and let the stalebot close it if nothing happens.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rusakovic commented 1 year ago

Until this gets implemented, a messy hack would be:

text = text.replace(/>/g, '> ');
text = sanitizeHtml(text, {allowedTags:[]});

it's 2023 now. thank you for your solution =)

boutell commented 1 year ago

@rusakovic Contributions are welcome!

adorum commented 7 months ago

it's 2024 now. :)

boutell commented 7 months ago

Yes, it's 2024 now, and as always, community contributions are welcome. 😄 This isn't a feature that matters for our use cases, although I appreciate it would be nice for developers reading the resulting markup.

boutell commented 7 months ago

Reopening for potential community PRs.

SystemDisc commented 3 months ago

Is anyone aware of a reliable way to get a list of block-tags? If so, this shouldn't be terribly difficult to implement, right? I'm not sure.

BoDonkey commented 3 months ago

Somewhat dubious source, but chatGPT says Sure! Here is a list of HTML block-level tags:

  1. <address>
  2. <article>
  3. <aside>
  4. <blockquote>
  5. <canvas>
  6. <dd>
  7. <div>
  8. <dl>
  9. <dt>
  10. <fieldset>
  11. <figcaption>
  12. <figure>
  13. <footer>
  14. <form>
  15. <h1> to <h6>
  16. <header>
  17. <hr>
  18. <li>
  19. <main>
  20. <nav>
  21. <ol>
  22. <p>
  23. <pre>
  24. <section>
  25. <table>
  26. <ul>

These tags are generally used to structure the main content of an HTML document.

abea commented 3 months ago

It might be easier to exclude inline elements. There are fewer(?) and they're generally easier to identify. The list of phrasing content elements minus inputs, media, br, and super randos (e.g., ruby) looks to be a pretty good start.

SystemDisc commented 3 months ago

This is what I got from ChatGPT:

Here is a comprehensive list of block-level HTML tags/elements available in the latest implementation(s) of HTML, including both standard and experimental/non-standard elements:

Standard Block-Level Elements

  1. <address> - Represents contact information for the author/owner of a document.
  2. <article> - Represents a self-contained composition in a document.
  3. <aside> - Represents a section of content indirectly related to the content around it.
  4. <blockquote> - Represents a section that is quoted from another source.
  5. <canvas> - Used for rendering graphics via scripting (usually JavaScript).
  6. <dd> - Describes a term in a description list.
  7. <div> - A generic container for flow content.
  8. <dl> - Represents a description list.
  9. <dt> - Represents a term/name in a description list.
  10. <fieldset> - Groups related elements in a form.
  11. <figcaption> - Represents a caption or legend for the content of its parent <figure>.
  12. <figure> - Represents self-contained content, like illustrations, diagrams, photos, etc.
  13. <footer> - Represents a footer for its nearest sectioning content or sectioning root element.
  14. <form> - Represents a document section containing interactive controls for submitting information.
  15. <h1> to <h6> - Represent headings, with <h1> being the highest (or most important) level and <h6> the lowest.
  16. <header> - Represents introductory content, typically a group of introductory or navigational aids.
  17. <hgroup> - Groups a set of <h1>-<h6> elements when a heading has multiple levels.
  18. <hr> - Represents a thematic break between paragraph-level elements.
  19. <li> - Represents a list item in an ordered or unordered list.
  20. <main> - Represents the main content of the <body>.
  21. <nav> - Represents a section of a page that links to other pages or to parts within the page.
  22. <ol> - Represents an ordered list of items.
  23. <p> - Represents a paragraph.
  24. <pre> - Represents preformatted text.
  25. <section> - Represents a standalone section — which doesn't have a more specific semantic element to represent it.
  26. <table> - Represents tabular data.
  27. <ul> - Represents an unordered list of items.

Deprecated Block-Level Elements

  1. <center> - Used to center-align text (use CSS instead).
  2. <dir> - Used for directory lists (use <ul> instead).
  3. <menu> - Initially intended for context menus, lists of commands, or form controls. It has been repurposed and is less common in modern usage for block-level content.

Non-Standard/Experimental Block-Level Elements

  1. <dialog> - Represents a dialog box or other interactive component, such as an inspector or window.
  2. <main> - Although now part of the HTML5 specification, it was experimental before being standardized. It represents the dominant content of the <body> of a document.
  3. <details> - Represents a disclosure widget from which the user can obtain additional information or controls.
  4. <summary> - Represents a summary, caption, or legend for a <details> element's disclosure box.

These elements are considered block-level because they typically start on a new line and take up the full width available (unless otherwise styled with CSS).

For the most up-to-date list, always refer to the latest HTML specification and browser documentation, as new elements and updates can be introduced.


Here is a comprehensive list of inline HTML tags/elements available in the latest implementation(s) of HTML, including both standard and experimental/non-standard elements:

Standard Inline Elements

  1. <a> - Defines a hyperlink.
  2. <abbr> - Represents an abbreviation or acronym.
  3. <b> - Represents a span of text stylistically different from normal text, without conveying any extra importance or emphasis.
  4. <bdi> - Isolates a span of text that might be formatted in a different direction from other text outside it.
  5. <bdo> - Overrides the current text direction.
  6. <br> - Produces a line break in text.
  7. <cite> - Represents the title of a work.
  8. <code> - Displays a fragment of computer code.
  9. <data> - Links a given content with a machine-readable translation.
  10. <dfn> - Indicates the term being defined within the context of a definition phrase or sentence.
  11. <em> - Marks text that has stress emphasis.
  12. <i> - Represents a span of text in an alternate voice or mood, or otherwise offset from the normal prose in a manner indicating a different quality of text.
  13. <img> - Embeds an image into the document.
  14. <input> - Allows the user to enter data.
  15. <kbd> - Represents user input from a keyboard, voice input, or any other text entry device.
  16. <label> - Represents a caption for an item in a user interface.
  17. <mark> - Represents text that has been highlighted for reference or notation purposes.
  18. <meter> - Represents either a scalar value within a known range or a fractional value.
  19. <noscript> - Defines a section of text to be displayed if a script type on the page is unsupported or if scripting is currently turned off in the browser.
  20. <object> - Represents an external resource, which can be treated as an image, a nested browsing context, or a resource to be handled by a plugin.
  21. <output> - Represents the result of a calculation or user action.
  22. <picture> - Contains zero or more <source> elements and one <img> element to offer alternative versions of an image for different display/device scenarios.
  23. <progress> - Represents the completion progress of a task.
  24. <q> - Indicates that the enclosed text is a short inline quotation.
  25. <s> - Represents text that is no longer accurate or relevant.
  26. <samp> - Represents sample output from a program or computing system.
  27. <script> - Contains scripting statements, or points to an external script file through the src attribute.
  28. <select> - Represents a control that provides a menu of options.
  29. <small> - Makes the text font size one size smaller (for example, from large to medium, or from small to x-small).
  30. <span> - Generic inline container for phrasing content, which does not inherently represent anything.
  31. <strong> - Indicates that its contents have strong importance, seriousness, or urgency.
  32. <sub> - Specifies inline text which should be displayed as subscript.
  33. <sup> - Specifies inline text which should be displayed as superscript.
  34. <template> - Holds client-side content that will not be rendered when the page loads but can be instantiated later using JavaScript.
  35. <textarea> - Represents a multi-line plain-text editing control.
  36. <time> - Represents either a time on a 24-hour clock or a precise date in the Gregorian calendar.
  37. <u> - Represents a span of inline text which should be rendered in a way that indicates that it has a non-textual annotation.
  38. <var> - Represents the name of a variable in a mathematical expression or a programming context.
  39. <wbr> - Represents a word break opportunity.

Deprecated Inline Elements

  1. <acronym> - Represents an acronym; use <abbr> instead.
  2. <big> - Makes the text font size one size larger.
  3. <tt> - Represents text in a fixed-pitch font; use CSS instead.
  4. <font> - Defines font, color, and size for text; use CSS instead.

Non-Standard/Experimental Inline Elements

  1. <slot> - Part of the Web Components technology suite, it is a placeholder inside a web component that you can fill with your own markup, similar to a content placeholder in other templating systems.

These elements are considered inline because they do not start on a new line and only take up as much width as necessary. For the most accurate and up-to-date list, always refer to the latest HTML specification and browser documentation.


In HTML, custom or undefined tags are treated as inline elements by default. This means that if you define a custom tag that is not recognized by the HTML specification, it will behave like an inline element unless you explicitly style it using CSS.

For example, if you create a custom tag <my-custom-element>, it will be treated as an inline element:

<my-custom-element>This is a custom element.</my-custom-element>

To change its behavior to a block-level element, you need to use CSS:

my-custom-element {
    display: block;
}

This CSS rule will make the custom element behave as a block-level element:

<my-custom-element>This is a custom element.</my-custom-element>

With the CSS applied, <my-custom-element> will now start on a new line and take up the full width available, like standard block-level elements.