"Remove empty bbcode" regex sometimes removes entire message

Oldiesmann commented 8 years ago

Ran into this issue on the seniors site I host after upgrading it to 2.0.12, but it's broken on 2.1 as well.

This code (on line 42 of Sources/Subs-Post.php in 2.1 - not sure about 2.0 offhand) causes the entire message to get removed.

    $message = preg_replace('~\[([^\]=\s]+)[^\]]*\](?' . '>\s|(?R))*?\[/\1\]\s?~i', '', $message);

This is the message that triggered the issue (from a forum I host). The first [center]...[/center] block works fine. If I try adding the second center block, the entire message disappears. I haven't tried pulling it apart beyond that. This is also broken in 2.0.12.

[center][color=navy][font=Verdana][size=14pt][b]Here's our

81st  Challenge for you all 
suggested by FlaJean who won the last Challenge.[/b][/size][/font][/color][/center]

[center][font=Verdana][color=blue][size=30pt][b]
[i]"C L O C K S"[/i][/b][/size][/color][/font][/center]

[font=Verdana][color=navy][size=10pt][b]Details:[/b]   [/size][/color][/font]

[list][li] [font=Verdana][color=green][size=10pt][b]From FlaJean:  "Give us Clocks. Any type of clocks (wall clocks, mounted clocks such as on a building, etc., but not in paintings)."[/b][/size][/color][/font]   [/li][/list]   

[list][li] [font=Verdana][color=navy][size=10pt]Use the attachment feature to show your photos in your post making sure that your image is no larger than 640 pixels wide or link to them from your gallery or an offsite location that allows linking using the image codes.  If you have trouble, please contact a moderator (name in blue) or an administrator (name in red) and we'll help you post your photo.  If you have your photo in your gallery, please don't use the attachment feature as that puts the photo on our server twice but rather link to it. [/size][/color][/font]   [/li][/list]

[list][li] [font=Verdana][color=navy][size=10pt]Please upload up to[b][color=red][size=14pt] 4[/size][/color][/b] entries per member for this challenge. [/size][/color][/font]   [/li][/list]

[list][li] [font=Verdana][color=navy][size=10pt]If you possibly can, please tell us the camera you used to take your photo and when you took it.  But if you don't know, please post the picture anyway.  We want to see your photos.  :)  Please only enter photos taken by you. [/size][/color][/font]   [/li][/list]

[list][li] [font=Verdana][color=navy][size=10pt]The challenge is open now and will be locked to uploads on [b]Saturday, October 22nd, 2016 [/b] and then we'll vote on them. [/size][/color][/font]   [/li][/list]

Here are a bunch of commits I found relating to the code... @08ff4bf @640cf11 @893de14 @0a1eafd @d5cda1c @8628bb6 @410a925

CC @Sesquipedalian

jdarwood007 commented 8 years ago

Any reason why this wont work?

    $message = preg_replace('~\[([^\]=\s]+)[^\]]*\]\s*\[/\1\]\s?~i', '', $message);

Sesquipedalian commented 8 years ago

Hm. Looks like that message produces catastrophic backtracking in the recursive regex, @Oldiesmann. I'll work on a fix for this ASAP.

I didn't know this code would be going out in 2.0.12. Will we be able to push out 2.0.13 as soon as I make a fix for this?

Your suggestion works for a single layer of empty BBCode, @jdarwood007 (in fact, it's what I was using in 0a1eafd) but not for multiple layers. For example, using a message like This is [i][b][/b][/i] some text, the [b][/b] would be matched and removed, but the enclosing [i][/i] would remain.

Sesquipedalian commented 8 years ago

I just submitted #3600 to fix this. I'm just using @jdarwood007's simpler, non-recursive regex in a while loop. This approach involves multiple invocations of the regex engine, but it gets the job done. There might be a more elegant and efficient way, but since we need to get a fix out for 2.0.x ASAP, this'll do well enough for now.

jdarwood007 commented 8 years ago

Does it need to be in a while loop to capture things like [b][s][/s][/b]?

A fix for 2.0 is needed, but lets try to fix it right.

MissAllSunday commented 8 years ago

Perhaps splitting the full message into smaller chunks and check those chunks instead might be better approach?

I honestly wouldn't imagine people use BBC for such monstrously formatted messages but unfortunately, they do, so the recursive approach will never be enough since there are always going to be super nest after nest, insert a table, another nest, another table, center, br, nest BBC formats.

Sesquipedalian commented 8 years ago

Does it need to be in a while loop to capture things like [b][s][/s][/b]?

There needs to be some way to recurse, yes. If we don't do it in the regex, then we need to do it in PHP. The while loop approach will take a relatively long time if—but only if—there are a whole bunch of nested BBCodes that are, in the end, empty. So, clearing out [b][i][u][s][b][i][u][s][b][i][u][s][/s][/u][/i][/b][/s][/u][/i][/b][/s][/u][/i][/b] would take 12 iterations of the loop. In contrast, dealing with @Oldiesmann's real world example would be quick, running the test just once and then moving on because there are no empty BBCodes in it.

Sesquipedalian commented 8 years ago

Perhaps splitting the full message into smaller chunks and check those chunks instead might be better approach?

Maybe, but how would we know where to split it?

MissAllSunday commented 8 years ago

Yes, thats something I was wondering about, every approach seems to be
limited somehow. Might be worth a shot to study how other tokenizer
libraries work, I was looking ad decoda https://github.com/milesj/decoda
the other day and will continue to do so to get some other ideas on how to
handle this.

On Thu, 06 Oct 2016 16:23:02 -0500, Jon Stovell notifications@github.com
wrote:

Perhaps splitting the full message into smaller chunks and check those
chunks instead might be better approach?

Maybe, but how would we know where to split it?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Jessica González, missallsunday.com

jdarwood007 commented 8 years ago

@Sesquipedalian Ok, yea then that is why I believe you have ?R originally. I'm at the edge of my knowledge of regex here, only barely understand (? matches.

Sesquipedalian commented 8 years ago

Yeah. I suppose the problem is that, fundamentally, regular expressions are just not a very good tool for parsing balanced constructs like HTML or BBCode. Advanced regex techniques like recursion make it theoretically possible to match nested balanced constructs, but even in the best case scenarios it is expensive.

An idea that I've toyed with before is to convert all the <, >, and & characters in a message to entities, and then convert the [ and ] characters to < and >. Then we could use PHP's support for XML, XSL, and XPath to parse and transform it all for us. These technologies are designed precisely for parsing and transforming the sort of balanced constructs that BBCode is modelled on, and they are blisteringly fast at it. An XSLT to remove all empty nodes, for example, is trivial both to create and to execute.

The trouble, of course, is that implementing this would be (to put it mildly) a rather significant rewrite of SMF's entire BBCode processing system. It might be a brilliant change to implement in SMF 3.0—it would do wonders for parse_bbc()—but not when we're trying to get 2.1 out the door.

Sesquipedalian commented 8 years ago

I suppose another option would be simply deciding not to strip out empty BBC at all. After all, it doesn't really hurt anything to have it there. It is unnecessary cruft and it is better to remove it if possible, for the sake of faster execution in parse_bbc() when the time comes to display the message, but removing it isn't mission critical.

MissAllSunday commented 8 years ago

There was a report about using empty tags to bypass the censor feature and other annoying cases like that, that's the primary reason to do this

Sesquipedalian commented 8 years ago

Ah. That makes sense. Well, in a case like that, I'd say it's just fine if takes a few fractions of a second longer for a post to made.

Sesquipedalian commented 8 years ago

Can I suggest that we merge #3600 at least for now? If and when we come up with a better method, we can replace it then.

MissAllSunday commented 8 years ago

Yep

MissAllSunday commented 8 years ago

OK, I've been doing some tests with the fix you provided @Sesquipedalian and the examples from the bug report and it seems they all pass. It seems this solution is the best one considering all the factors involved, I'm preparing a new patch along with some other reports.

Sesquipedalian commented 8 years ago

Cool. I haven't been able to think of anything better either. When I initially came up with #3600, I was just thinking that it would be safe way to do the job, rather than an efficient way. But on further consideration, it has two very important virtues: it exits quickly when there's nothing to be done, and the processing time grows only linearly with each nested level of empty BBCode. Maybe some more time playing around with atomic groups could turn up a single regex with both of those virtues, but even then the cost savings will only consist in not needing to invoke the regex engine more than once.

SimpleMachines / SMF

"Remove empty bbcode" regex sometimes removes entire message #3599