erusev / parsedown

Better Markdown Parser in PHP
https://parsedown.org
MIT License
14.79k stars 1.12k forks source link

Parsing of "__foo__" error #364

Open doiftrue opened 9 years ago

doiftrue commented 9 years ago

One: If we parse this line:

post__in foo
bar post__in

we get

<p>post<strong>in foo
bar post</strong>in</p>

__ replace with strong - it's error...

Two: For ex. on github even post__in__ not parsing. In parsedown we have: post<strong>in</strong> Don't know about this, for now I have no trouble with second example.

doiftrue commented 9 years ago

Found decision of the problem. Maybe will be helpful

    protected $StrongRegex = array(
        '*' => '/^[*]{2}((?:\\\\\*|[^*]|[*][^*]*[*])+?)[*]{2}(?![*])/s',
        '_' => '/^__((?:\\\\_|[^_]|_[^_]*_)+?)__(?!_)/us',
    );

add \n

'*' => '/^[*]{2}((?:\\\\\*|[^*]|[*][^*]*[*])+?)[*]{2}(?![*])/s',
'_' => '/^__((?:\\\\_|[^_\n]|_[^_]*_)+?)__(?!_)/us',

And one more question. Why we need s modifier? <strong> <em> - it's inline elements, not?

doiftrue commented 9 years ago

One more little research based on spec.commonmark.org parser. _ and * parsess differently.

Ex:

post__in foo
bar post__in

becomes

<p>post__in foo
bar post__in</p>

but if we try ** instead, we get this:

post**in foo
bar post**in

becomes

<p>post<strong>in foo
bar post</strong>in</p>

Сonclusion is that _ can be part of word and so if it appear near any word it shell not be parsed. But * can't be part of any word, so it parses as it is now in parsedown.

So, for now I make such regex:

$this->StrongRegex['_'] = '/^__((?:\\\\_|[^_\n]|_[^_]*_)+?)__(?!_)\b/us';
$this->EmRegex['_'] = '/^_((?:\\\\_|[^_\n]|__[^_]*__)+?)_(?!_)\b/us';

And I need to add \b at the begining but there is ^. Seems I need use the Context in parse function but how do this correctly I don't know.

erusev commented 9 years ago

Thanks, I appreciate it.

Sigma-90 commented 6 years ago

I've noticed a similar problem with underscores, so I figured it is probably related and thus decided to post it here instead of opening up another issue. It seems that single underscores break the detection of bold marker pairs:

The following Markdown is from a Readme on Github:

* __callback_params__  
  __Expects:__ Array / Void  
  __Default Value:__ null  
  __Description:__  
...

Github works as expected and renders:

<li>
<p><strong>callback_params</strong><br>
<strong>Expects:</strong> Array / Void<br>
<strong>Default Value:</strong> null<br>
<strong>Description:</strong><br>
...

Running it through Parsedown, however, yields quite different results:

<li>
<p>__callback_params<strong><br />
</strong>Expects:<strong> Array &#x2F; Void<br />
</strong>Default Value:<strong> null<br />
</strong>Description:__<br />
...

This is a severe issue, considering Markdown is quite popular for documenting code and variable names are often written with underscores (not everyone camelCases everything).

domsson commented 5 years ago

This seems to be related to #198 and #703