erusev / parsedown

Better Markdown Parser in PHP
https://parsedown.org
MIT License
14.73k stars 1.12k forks source link

Parsedown does not treat emphasis tags (*, _, etc.) as literal when spaced out #531

Open richjeffery opened 6 years ago

richjeffery commented 6 years ago

Daring Fireball's documentation states:

Markdown treats asterisks (*) and underscores (_) as indicators of emphasis. (...) But if you surround an * or _ with spaces, it’ll be treated as a literal asterisk or underscore. ( https://daringfireball.net/projects/markdown/syntax#em )

So, in most Markdown parsers (including GitHub): *This is italic* produces: This is italic

and: * This is not italic * produces: This is not italic

However, the current version of Parsedown treats both as italic. This is confirmed with the current version of the http://parsedown.org/demo site, which shows Markdown PHP working as expected, and Parsedown tripping over.

richjeffery commented 5 years ago

Happy birthday, bug! Any further movement on this getting resolved, as it's been sitting as a bug within my application for a year now.

alec-w commented 5 years ago

I can replicate this on release 1.7.1 and directly on master - but not on the demo site http://parsedown.org/demo. There it has the same behaviour as Markdown PHP 1.3 and the results of Guthub preview. Any ideas on what version is being used on the demo site?

alec-w commented 5 years ago

Looks like this can be fixed by amending the $EmRegex to exclude matches that start/end with whitespace. The current regex is

protected $EmRegex = array(
    '*' => '/^[*]((?:\\\\\*|[^*]|[*][*][^*]+?[*][*])+?)[*](?![*])/s',
    '_' => '/^_((?:\\\\_|[^_]|__[^_]*__)+?)_(?!_)\b/us',
);

changing this to

protected $EmRegex = array(
    '*' => '/^[*]([^\s](?:\\\\\*|[^*]|[*][*][^*]+?[*][*])+?[^\s])[*](?![*])/s',
    '_' => '/^_([^\s](?:\\\\_|[^_]|__[^_]*__)+?[^\s])_(?!_)\b/us',
);

sorts it. Similarly, the $StrongRegex ought to be updated too, currently

protected $StrongRegex = array(
    '*' => '/^[*]{2}((?:\\\\\*|[^*]|[*][^*]*+[*])+?)[*]{2}(?![*])/s',
    '_' => '/^__((?:\\\\_|[^_]|_[^_]*+_)+?)__(?!_)/us',
);

and following the same pattern as above this becomes

protected $StrongRegex = array(
    '*' => '/^[*]([^\s](?:\\\\\*|[^*]|[*][*][^*]+?[*][*])+?[^\s])[*](?![*])/s',
    '_' => '/^__(([^\s]?:\\\\_|[^_]|_[^_]*+_)+?[^\s])__(?!_)/us',
);

Testing this with

$parsedown = new Parsedown();
echo $parsedown->line('*italic* * not italic* *also not italic * *italic again*');
echo '</br>';
echo $parsedown->line('_italic_ _ not italic_ _also not italic _ _italic again_');

previously gave italic not italic also not italic italic again italic not italic also not italic italic again but now gives italic not italic also not italic italic again italic not italic also not italic italic again gives the correct output now. @erusev if you agree with the above I can make a PR for it - or am I off the mark here?

aidantwoods commented 5 years ago

I was actually just in the process of exploring a fix around moving more towards the CommonMark spec for defining the opening and closing delimiter runs: since this'll help avoid some other discrepancies. I'm hoping that it might be possible to do so in a way that doesn't require leaning on intersection rules since resolving precedence rules about intersecting inlines is currently out of scope (see https://github.com/erusev/parsedown/issues/588#issuecomment-377221179). If we're pulling apart the emphasis parsing, then it'll also be an opportunity to address some long-standing more widespread escaping issues (#434)—if we can tackle emphasis then it should be fairly easy to address the simpler inlines.

If I can't make anything nice work around the CommonMark definitions, then the solution you've outlined above would also seem address the issue raised here fairly directly. (Note that it doesn't tackle part (b) in the CommonMark definition's of left/right flanking though: https://spec.commonmark.org/0.28/#left-flanking-delimiter-run, IMO we should prefer a solution that incorporates the CommonMark definition if possible).

richjeffery commented 5 years ago

alec-w's response works well, but there is a minor mistake in the strong regex - I believe it should be:

protected $StrongRegex = array(
        '*' => '/^[*]{2}([^\s](?:\\\\\*|[^*]|[*][*][^*]+?[*][*])+?[^\s])[*]{2}(?![*])/s',
        '_' => '/^__(([^\s]?:\\\\_|[^_]|_[^_]*+_)+?[^\s])__(?!_)/us',
    );
xserrat commented 5 years ago

Hi guys!

Any updates related to this issue?

Thanks!

richjeffery commented 5 years ago

Quick update, turns out alec-w's fix breaks if you have just two characters between your tags.

i.e.: this _is_ italic will not work correctly, nor will this **is** bold.

richjeffery commented 5 years ago

Figured out a correction that seems to work without breaking with only two or one character between your emphasis tags:

    protected $StrongRegex = array(
        '*' => '/^[*]{2}(?:([^\*\s]|[^\W](?:\\\\\*|[^*]|[*][^*]*+[*])*?[^\W]))[*]{2}(?![*])/s',
        '_' => '/^[_]{2}([^\W_]|(?:[^\W](?:\\\\_|[^_]|_[^_]*+_)*?[^\s]))[_]{2}(?!_)/s',
    );

    protected $EmRegex = array(
        '*' => '/^[*](?:([^\*\s]|[^\W](?:\\\\\*|[^*]|[*][^*]*+[*])*?[^\W]))[*](?![*])/s',
        '_' => '/^[_]([^\W_]|(?:[^\W_](?:\\\\_|[^_]|_[^_]*+_)*?[^\s_]))[_](?!_)/s',
    );
erusev commented 5 years ago

Can't we fix this with a lookaround?

Here's how it'd look for $EmRegex['_']:

- /^_((?:\\\\_|[^_]|__[^_]*__)+?)_(?!_)\b/us
+ /^_(?!\s)((?:\\\\_|[^_]|__[^_]*__)+?)(?<!\s)_(?!_)\b/us

Basically, the modification adds (?!\s) after the opening _ and a (?<!\s) before the closing _

Tests seem to pass.

Here's a quick demo: https://regex101.com/r/cBxRKo/1

richjeffery commented 5 years ago

@erusev - that's definitely a good solution. Whilst it requires more steps to complete on short phrases (e.g. _hello_) it definitely is a significantly superior regex in terms of completion steps for longer sentences. Thanks!

richjeffery commented 2 years ago

Tests seem to pass.

Here's a quick demo: https://regex101.com/r/cBxRKo/1

Looks like the testing may be lacking some nuance - just realised today that a bug I found with ***Test*** (which should be bold-italic) renders as Bold with an asterisk on each side when using this change.