facebook / hhvm

A virtual machine for executing programs written in Hack.
https://hhvm.com
Other
18.15k stars 2.99k forks source link

Zend incompatibility: xml_set_element_handler() behaves differently #1391

Open tholu opened 10 years ago

tholu commented 10 years ago

To reproduce, use the following script (e.g. xml_parse.php):

<?php
function start_elem($parser,$name,$attribs) {
   echo "<$name>";
}
function end_elem($parser,$name)
{
   echo "</$name>";
}

$parser=xml_parser_create();
xml_parser_set_option($parser,XML_OPTION_CASE_FOLDING,0);
xml_set_element_handler($parser,"start_elem","end_elem");
$buf = '<F>';
echo xml_parse($parser,$buf,strlen($buf)==0);

then compare

php xml_parse.php

Output: 1

with

hhvm --mode debug xml_parse.php

Run with r.

Output: <F>1

It seems that start_elem() is somehow not called in the Zend implementation (maybe a bug there or is this intended?).

php --version
PHP 5.4.4-14+deb7u7 (cli) (built: Dec 12 2013 08:42:07)
Copyright (c) 1997-2012 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2012 Zend Technologies
    with XCache v2.0.0, Copyright (c) 2005-2012, by mOo
scannell commented 10 years ago

Thanks for reporting this. I'm actually not sure what the right behavior is -- it's probably up for debate -- but someone should look at this at some point.

tholu commented 10 years ago

Note: This issue breaks http://php-java-bridge.sourceforge.net/pjb/

tholu commented 10 years ago

PHP calls start_elem after it has read the whole element (e.g. <F></F> works with both, HHVM and PHP), while HHVM calls start_elem immediately after reading <F>.

eastzone commented 10 years ago

After some digging, it seems that HHVM's behavior is consistent with the underlying libexpat. See this example

#include <expat.h>
#include <stdio.h>
#include <string.h>

void start_element(void *data, const char *element, const char **attribute) {
  printf("<%s>\n", element);
}

void end_element(void *data, const char *element) {
  printf("</%s>\n", element);
}

int main(void) {
  XML_Parser parser = XML_ParserCreate(NULL);
  XML_SetElementHandler(parser, start_element, end_element);
  char buff[] = "<F>";
  XML_Parse(parser, buff, strlen(buff), XML_TRUE);
  XML_ParserFree(parser);
  return 0;
}

Python example

import xml.parsers.expat
def start_element(name, attrs):
    print '<%s>' % name;

def end_element(name):
    print '</%s>' % name;

p = xml.parsers.expat.ParserCreate()

p.StartElementHandler = start_element
p.EndElementHandler = end_element

p.Parse("<F>")

Both will output .

eastzone commented 10 years ago

This is likely to be a bug in php5. The following is done in PHP5.

<?php
function start_elem($parser,$name,$attribs) {
   echo "<$name>";
}
function end_elem($parser,$name)
{
   echo "</$name>";
}

$parser=xml_parser_create();
xml_parser_set_option($parser,XML_OPTION_CASE_FOLDING,0);
xml_set_element_handler($parser,"start_elem","end_elem");
$buf = '<F>';
echo xml_parse($parser,$buf,strlen($buf)==0);

Will output nothing 1. But

$buf = '<Foo>';

will output <Foo>1, just like in HHVM.

Can you give some details of how this breaks pjb? It seems that one should not rely on PHP5's this behavior at all...

My php version

zeng-mbp:php-5.5.10 zeng$ php --version
PHP 5.4.24 (cli) (built: Jan 19 2014 21:32:15) 
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2013 Zend Technologies
tholu commented 10 years ago

Thanks for digging deeper. I have the following test script with PJB (Java.inc):

<?php
$query = "test";
require_once('Java.inc');
$escapedQuery = java_values(java('org.apache.lucene.queryParser.QueryParser')->escape($query));
var_dump($escapedQuery);

This works with PHP, while calling with HHVM gives

HipHop Fatal error: protocol error: <O v="1" m="org.apache.lucene.queryParser.QueryParser" p="O" n="F"/>,no element found at col 3. Check the back end log for OutOfMemoryErrors. in <path>/Java.inc on line 879

It has nothing to do with OutOfMemory errors of course. I tracked this down to the different behaviour of xml_set_element_handler in PHP and HHVM. I don't know if there are more problems if this is fixed, though.

shell-l-d commented 10 years ago

I've found the same bug, the start & stop element functions declared inside of: xml_set_element_handler( $parser, "startElement", "stopElement" ); never get called (tried adding echo "TEST";) to confirm this.

SORRY can't figure out how to show code, so adding images instead, tried < code > & [ code ]. students_xml viewstudentdata_php viewstudentdata_php_output

SiebelsTim commented 10 years ago

Guide for formatting: https://help.github.com/articles/github-flavored-markdown

Put your code in ``` tags

shell-l-d commented 10 years ago

Ooops ignore that, it appears that the $element_name variables are all capitalised :)

viewstudentdata_output2

tholu commented 10 years ago

Perhaps this comment on my corresponding thread from StackOverflow helps:

POSIX textfiles are expected to have a line-ending. in your buffer that line-ending is missing which is why the element that is opened but never closed before reaching EOF (EOB) is the cutted from the input sequence as data is missing. you could also just append a space or another different character that would shift the internal state of the parser at least by one character making it aware that your string should be an element. Your input BTW is not XML. You probably would like to make it self-closing like which is supported by that parser. – hakre

http://stackoverflow.com/questions/21389028/php-xml-parse-and-xml-set-element-handler

tholu commented 10 years ago

I think this boils down to HHVM using libexpat vs PHP using libxml2.

JoelMarcey commented 9 years ago

@tholu I believe you are correct in your final assessment. I am going to keep this item open as a wishlist item. If you would like to reimplement our xml libraries using libxml2, we would certainly consider a pull request :)