Text after '<' character is lost

akhoury / bbcode-to-markdown

node module to convert bbcode to markdown

MIT License

6 stars 7 forks source link

Text after '<' character is lost #3

Closed ghost closed 8 years ago

ghost commented 9 years ago

Minimal example:

A mathematical range breaks document <3, 5]
This text will be lost

akhoury commented 9 years ago

hmm-- that's really the HTML parser's issue.

here's a simple test case.


var jsdom = require('jsdom-nogyp');
var doc = jsdom.jsdom(null, null, {
    features: {
        FetchExternalResources: false
    },
    url: "file://" + (process.cwd())
});
var win = doc.parentWindow;
var container = win.document.createElement('div');

var html = 'A mathematical range breaks document <3, 5] This text will be lost';
container.innerHTML = html;

console.log(container.innerHTML);
// will output
// A mathematical range breaks document

maybe we should ask here: https://github.com/dexteryy/jsdom-nogyp

akhoury commented 9 years ago

jsdom-nogyp is using https://github.com/fb55/htmlparser2

goto http://demos.forbeslindesay.co.uk/htmlparser2/

paste: A mathematical range breaks document <3, 5] This text will be lost

[
  {
    data: 'A mathematical range breaks document '
    type: 'text'
    next: null
    prev: null
    parent: null
  }
]

that's probably where we should ask

akhoury commented 9 years ago

so yea, a real browser seems to be able to figure that out correctly https://jsfiddle.net/fyx6fy0c/

ghost commented 9 years ago

It looks like htmlparser2 accepts each type of tags, so we can just replace all occurrences of '<' to < before parsing input as simply workaround.

akhoury commented 9 years ago

no - that will break any real html tag <p> <3, 5] </p> You need a way to differentiate and parse html tags correctly, which is what the htmlparser2 is supposed to do - regex won't work, not with infinite nesting.

akhoury commented 9 years ago

instead of < you can still also put a space between the < and the 3 - i.e. < 3, 5]

akhoury commented 9 years ago

but parser problem remains - you can't do that programmatically without an htmlparser.-

UNLESS - you know for a fact that there is NO HTML in your text.

ghost commented 9 years ago

BBCode post doesn't contain html typically, but if it has any, it should be escaped anyway imho. However, I agree with you, htmlparser2 should do its job - parsing html, not xml-like tags.

akhoury commented 8 years ago

fixed when i switched to using to-markdown