fb55 / htmlparser2

The fast & forgiving HTML and XML parser
https://feedic.com/htmlparser2
MIT License
4.42k stars 373 forks source link

Brackets break attributes #14

Closed fb55 closed 11 years ago

fb55 commented 12 years ago

Just a note for anyone with a lot of time:

<div id=">"> foo </div>

Expected behavior: Return a div with an id of >, followed by the string foo. Result: A div with an id of ", followed by the string "> foo.

The problem is that there isn't a possibility for a look-ahead. Besides, the markup is clearly broken. The current result should be acceptable in most cases.

Edit: Apparently, that bug is well-known.

scriby commented 12 years ago

I just ran across this with embedded script.

For example, <button onclick="alert(x > 5);">test</button>. That's a simplified example of some actual script I have embedded.

It works to replace the > with >, but that makes it harder to read.

scriby commented 12 years ago

A similar issue exists with < as well.

matthewmueller commented 11 years ago

hey man, any chance you could have a look at this one when you have some time? I'm getting a lot of complaints from angular / knockout folks

fb55 commented 11 years ago

This would require pretty much a complete rewrite of the parser. When I get the FSM compiler I'm working on going (planning to spend some time on it next weekend), we'll have a full-featured HTML5 parser. That should fix this issue.

sindresorhus commented 11 years ago

FSM compiler?

Will it be spec compliant?

fb55 commented 11 years ago

Finite state machine compiler. As the HTML5 parser spec uses a FSM to describe the parser, I intend to use that as the foundation of the parser. My hopes are great speed, while reaching full compliance :)

sindresorhus commented 11 years ago

That sounds really great!

Is there any timeline for that work?

fb55 commented 11 years ago

This weekend, I want to finish a working prototype of the FSM compiler, the HTML5 parser will follow soon. But I don't promise anything: This is just a personal side-project, I'm not getting any profit out of it, so there is no timeline.

I spend time on it when I feel like it, with the intention to learn some things while doing so. This includes reading related academic material, some benchmark-driven development and trying things out, sometimes not even on the computer, but on a sheet of paper (state machines are great for that).

sindresorhus commented 11 years ago

@fb55 Thanks, exactly what I wanted to know :)

Good luck with the project and have fun.

Would be really interesting to read a blog post about the development of the parser

matthewmueller commented 11 years ago

hey man, sorry to keep bugging you about this. I'm starting to want this myself for computed properties in http://github.com/components/reactive.

Has there been any progress on this? Is there anything I can do to help? Also, I'm pretty sure lookaheads are supported in JS (by capturing groups). Is there any temporary regex fix we can implement in the meantime?

Also, I was looking around at some of the other parsers and it looks like there's a bunch of open tickets for this and none of them have really solved it. Would be sweet to be the first one :-D

Thanks man.

fb55 commented 11 years ago

The original tautologistics/node-htmlparser fixed that bug in a complete rewrite. It still has a number of bugs I fixed a while ago, so I wouldn't recommend a switch, but at least it shows how much changes are required to get a universal solution.

JS regexps support look-aheads, but that doesn't help here, as parsing depends on content inside brackets; further content might not be available. You probably want to use a FSM to solve this, and you could use the official HTML5 tokenizer algorithm as soon as you are there.

For now, I lack the time to continue working on this. A quick-and-dirty solution would be to replace everything that looks like a bracket inside an attribute with the entity, but this leads to new bugs. The regexp in Parser.js#L39 should be a good starting point for this.

fb55 commented 11 years ago

A quick update:

Today, I wrote a tokenizer for this project (5e6fcb3bdc8806c6a36743c14681862da212a952), which creates the same* tokens as the old parser, but fixes this specific bug.

Support for special tags (<script> and <style>) is currently missing, and segments that aren't completed are currently simply ignored (the old parser emits partial sections at the end). As soon as that's fixed, the parser will switch to the tokenizer.

*It won't split textblocks, comments & cdata into sections anymore (which wasn't nice, anyway).

PS: The FSM compiler was a great idea, but you can push it as far as you'd like, and that's what I currently don't have the time for. I'll probably pick it up for my bachelor thesis.

matthewmueller commented 11 years ago

Awesome man, thanks for the ongoing great work. Just to be clear - do you anticipate this tokenizer landing in node-htmlparser2 or do you see this commit as more of an exploration for now?

fb55 commented 11 years ago

I plan to use the tokenizer inside the parser. Parser.js should implement a subset of the HTML treebuilder algorithm, and nothing more.

fb55 commented 11 years ago

Fixed with 3.0.0.