Closed Oldiesmann closed 8 years ago
Any reason why this wont work?
$message = preg_replace('~\[([^\]=\s]+)[^\]]*\]\s*\[/\1\]\s?~i', '', $message);
Hm. Looks like that message produces catastrophic backtracking in the recursive regex, @Oldiesmann. I'll work on a fix for this ASAP.
I didn't know this code would be going out in 2.0.12. Will we be able to push out 2.0.13 as soon as I make a fix for this?
Your suggestion works for a single layer of empty BBCode, @jdarwood007 (in fact, it's what I was using in 0a1eafd) but not for multiple layers. For example, using a message like This is [i][b][/b][/i] some text
, the [b][/b]
would be matched and removed, but the enclosing [i][/i]
would remain.
I just submitted #3600 to fix this. I'm just using @jdarwood007's simpler, non-recursive regex in a while loop. This approach involves multiple invocations of the regex engine, but it gets the job done. There might be a more elegant and efficient way, but since we need to get a fix out for 2.0.x ASAP, this'll do well enough for now.
Does it need to be in a while loop to capture things like [b][s][/s][/b]?
A fix for 2.0 is needed, but lets try to fix it right.
Perhaps splitting the full message into smaller chunks and check those chunks instead might be better approach?
I honestly wouldn't imagine people use BBC for such monstrously formatted messages but unfortunately, they do, so the recursive approach will never be enough since there are always going to be super nest after nest, insert a table, another nest, another table, center, br, nest BBC formats.
Does it need to be in a while loop to capture things like [b][s][/s][/b]?
There needs to be some way to recurse, yes. If we don't do it in the regex, then we need to do it in PHP. The while loop approach will take a relatively long time if—but only if—there are a whole bunch of nested BBCodes that are, in the end, empty. So, clearing out [b][i][u][s][b][i][u][s][b][i][u][s][/s][/u][/i][/b][/s][/u][/i][/b][/s][/u][/i][/b]
would take 12 iterations of the loop. In contrast, dealing with @Oldiesmann's real world example would be quick, running the test just once and then moving on because there are no empty BBCodes in it.
Perhaps splitting the full message into smaller chunks and check those chunks instead might be better approach?
Maybe, but how would we know where to split it?
Yes, thats something I was wondering about, every approach seems to be
limited somehow. Might be worth a shot to study how other tokenizer
libraries work, I was looking ad decoda https://github.com/milesj/decoda
the other day and will continue to do so to get some other ideas on how to
handle this.
On Thu, 06 Oct 2016 16:23:02 -0500, Jon Stovell notifications@github.com
wrote:
Perhaps splitting the full message into smaller chunks and check those
chunks instead might be better approach?Maybe, but how would we know where to split it?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Jessica González, missallsunday.com
@Sesquipedalian Ok, yea then that is why I believe you have ?R originally. I'm at the edge of my knowledge of regex here, only barely understand (? matches.
Yeah. I suppose the problem is that, fundamentally, regular expressions are just not a very good tool for parsing balanced constructs like HTML or BBCode. Advanced regex techniques like recursion make it theoretically possible to match nested balanced constructs, but even in the best case scenarios it is expensive.
An idea that I've toyed with before is to convert all the <
, >
, and &
characters in a message to entities, and then convert the [
and ]
characters to <
and >
. Then we could use PHP's support for XML, XSL, and XPath to parse and transform it all for us. These technologies are designed precisely for parsing and transforming the sort of balanced constructs that BBCode is modelled on, and they are blisteringly fast at it. An XSLT to remove all empty nodes, for example, is trivial both to create and to execute.
The trouble, of course, is that implementing this would be (to put it mildly) a rather significant rewrite of SMF's entire BBCode processing system. It might be a brilliant change to implement in SMF 3.0—it would do wonders for parse_bbc()
—but not when we're trying to get 2.1 out the door.
I suppose another option would be simply deciding not to strip out empty BBC at all. After all, it doesn't really hurt anything to have it there. It is unnecessary cruft and it is better to remove it if possible, for the sake of faster execution in parse_bbc()
when the time comes to display the message, but removing it isn't mission critical.
There was a report about using empty tags to bypass the censor feature and other annoying cases like that, that's the primary reason to do this
Ah. That makes sense. Well, in a case like that, I'd say it's just fine if takes a few fractions of a second longer for a post to made.
Can I suggest that we merge #3600 at least for now? If and when we come up with a better method, we can replace it then.
Yep
OK, I've been doing some tests with the fix you provided @Sesquipedalian and the examples from the bug report and it seems they all pass. It seems this solution is the best one considering all the factors involved, I'm preparing a new patch along with some other reports.
Cool. I haven't been able to think of anything better either. When I initially came up with #3600, I was just thinking that it would be safe way to do the job, rather than an efficient way. But on further consideration, it has two very important virtues: it exits quickly when there's nothing to be done, and the processing time grows only linearly with each nested level of empty BBCode. Maybe some more time playing around with atomic groups could turn up a single regex with both of those virtues, but even then the cost savings will only consist in not needing to invoke the regex engine more than once.
Ran into this issue on the seniors site I host after upgrading it to 2.0.12, but it's broken on 2.1 as well.
This code (on line 42 of Sources/Subs-Post.php in 2.1 - not sure about 2.0 offhand) causes the entire message to get removed.
This is the message that triggered the issue (from a forum I host). The first
[center]...[/center]
block works fine. If I try adding the second center block, the entire message disappears. I haven't tried pulling it apart beyond that. This is also broken in 2.0.12.Here are a bunch of commits I found relating to the code... @08ff4bf @640cf11 @893de14 @0a1eafd @d5cda1c @8628bb6 @410a925
CC @Sesquipedalian