andrejewski / himalaya

JavaScript HTML to JSON Parser
http://andrejewski.github.io/himalaya
ISC License
921 stars 129 forks source link

Don't allow block level elements inside an unclosed <p> tag #88

Open DAreRodz opened 6 years ago

DAreRodz commented 6 years ago

It would be nice that Himalaya handles weird cases like browsers do, e.g. the following code:

<body>
  <p>
  <div></div>
</body>

In Himalaya, the resulting JSON is an element <p> wrapping an element <div>, like in this case:

<body>
  <p>
    <div></div>
  </p>
</body>

But browsers close the <p> tag before opening the <div>.

<body>
  <p></p>
  <div></div>
</body>

This is because a <p> element’s end tag may be omitted if the element is immediately followed by some tags, as indicated in the HTML5 spec.

andrejewski commented 6 years ago

Himalaya does account for some of these cases with closingTagAncestorBreakers, where we could add { p: [...blockLevelTags] }.

I think my hesitation to include all possible rules stems from not wanting to keep updating a list of different HTML tag categories. I will reconsider. The issues on my mind are:

It may be worth unblocking your work with the above defaults change. I am working on a few projects right now so this is not a high priority and will take time to find a good solution.

If anyone has thoughts, feel free to share them here.

DAreRodz commented 6 years ago

I'm wondering if you would accept a pull request for solving this particular issue, and maybe in the future for other possible cases that may come up.

The whole list of rules could be long but it's something that doesn't seem to change a lot (especially considering that the HTML specification maintains rules for historical reasons to keep running in older browsers), so I think it would be a good idea to handle most of the possible cases according to the specs.

We can help you with PR's anyway :)

andrejewski commented 6 years ago

@DAreRodz I think the fix to your issue is a few lines of code on your end:

import {parseDefaults} from 'himalaya'
parseDefaults.closingTagAncestorBreakers.p = ['div']

This is preferable to me because I don't want to make many patch releases as these types of issues are found. I would much rather put in the work upfront to make a one-shot release containing:

This will take time, when I can find it. Let me know if the above workaround does not work.

DAreRodz commented 6 years ago

@andrejewski I tried to make it work using the workaround, but it's still not working.

Looking at the code, it seems like the parsing algorithm automatically closes tags that are included in closingTags but only when an opening tag of the same type appears.

Moreover, closingTagAncestorBreakers is a way to prevent the auto-close behaviour mentioned above, so if I use the following line

parseDefaults.closingTagAncestorBreakers.p = ['div']

what I'm actually doing is allowing nested <p> tags when a <div> tag appears between them, and that's the opposite of what I'm trying to do!

I would suggest to replace closingTags array by an object and thus define a list of tags that could close an specific tag.

Anyway, this issue is not critical for us and we will continue using this library.