Open jj5 opened 7 years ago
If you have an example of a data structure we can push through as a test, we'll get his included.
Some details on the reference types (r and R) here:
http://www.phpinternalsbook.com/classes_objects/serialization.html
Example object:
$obj = new stdClass;
$obj->p1 = 'abc';
$obj->p2 = $obj;
$obj->p3 =& $obj->p1;
echo serialize($obj);
// O:8:"stdClass":3:{s:2:"p1";s:3:"abc";s:2:"p2";r:1;s:2:"p3";R:2;}
p3 is a reference to the second value in the structure. p1 is a reference to the first value (which is the whole thing). I think r
is used for object references and R
for explicit =&
references.
I think this parser should not just declare "recursion" and walk away. So long as it can keep track of each value it encounters, then it should be able to link the references properly, so the final parsed structure will have its own proper recursion in it.
Indexing a reference to each value is easy enough. The difficulty comes in the order in which they are parsed. If a reference points forward to an element that has not yet been parsed, then we need to keep it until later to link it up. So creating back-references can be done immediately, but forward references would be kept to reference when possible (or left until the end). A two-pass parsing could also work, but is unnecessary IMO.
I'm working on this in the background. The approach I'm taking is:
R
or r
) will result in the reference number being stored as an intermediate (temporary) value and the path to the reference being stored in a list like the values. The list does not need to be ordered in that case - just a stack.=& $scalar
) or a reference-link pointer to an object (= $object
).Notes:
Hopefully that's clear. I'm just posting this to avoid duplicated effort, and to show it's not that simple (probably why none of the C/Python/C# libraries I've found for tackling this have even attempted recursion on the source data.
Been playing with the recursion over the weekend, and it seems that the way PHP serializes it is rather bizarre. Take this as an example:
$arr = [
'a' => 'one',
];
$arr['b'] =& $arr['a'];
var_dump($arr);
echo serialize($arr);
/*
array(2) {
["a"]=>
&string(3) "one"
["b"]=>
&string(3) "one"
}
a:2:{s:1:"a";s:3:"one";s:1:"b";R:2;}
*/
Here element b
references element a
. The var_dump shows the value of element b
since that is what it contains (a
and b
share the same source data). The serialize does something different - it provides the value for a
- the first time the shared data is encountered. The second time it is encountered, it is shown as a hard reference to data item number 2 (the whole array is item number 1, and a
is item number 2). This is exactly what I would expect, and that's easy enough to parse.
If a
is itself an array, then it works just the same:
$arr = [
'a' => ['x' => 'ten', 'y' => 'eleven'],
];
$arr['b'] =& $arr['a'];
/*
array(2) {
["a"]=>
&array(2) {
["x"]=>
string(3) "ten"
["y"]=>
string(6) "eleven"
}
["b"]=>
&array(2) {
["x"]=>
string(3) "ten"
["y"]=>
string(6) "eleven"
}
}
a:2:{s:1:"a";a:2:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";}s:1:"b";R:2;}
*/
Note that the "ten" and "eleven" elements are shown in the var_dump()
for convenience, but only appear once in the serialization. Again, simple to parse.
Now this is where is starts to get crazy with the recursion. If a
references the root array rather than the a
element, this is what happens:
$arr = [
'a' => ['x' => 'ten', 'y' => 'eleven'],
];
$arr['b'] =& $arr;
/*
array(2) {
["a"]=>
array(2) {
["x"]=>
string(3) "ten"
["y"]=>
string(6) "eleven"
}
["b"]=>
&array(2) {
["a"]=>
array(2) {
["x"]=>
string(3) "ten"
["y"]=>
string(6) "eleven"
}
["b"]=>
*RECURSION*
}
}
a:2:{s:1:"a";a:2:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";}s:1:"b";a:2:{s:1:"a";a:2:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";}s:1:"b";R:5;}}
*/
Again, the var_dump()
shows the value that is referenced, and recognises where recursion occurs and labels it appropriately. That's good and consistent with the previous examples.
But now look at the serialized string. Suddenly the source (root) array is being replicated - it is NOT a reference any more. The b
in that duplication is a reference though, but to the copy of the root array (data item number 5 and not data item number 2.
If I add an extra element to the a
array then I see it appear twice, so internally the data is a reference:
$arr['a']['z'] = 'twelve';
/*
array(2) {
["a"]=>
array(3) {
["x"]=>
string(3) "ten"
["y"]=>
string(6) "eleven"
["z"]=>
string(6) "twelve"
}
["b"]=>
&array(2) {
["a"]=>
array(3) {
["x"]=>
string(3) "ten"
["y"]=>
string(6) "eleven"
["z"]=>
string(6) "twelve"
}
["b"]=>
*RECURSION*
}
}
a:2:{s:1:"a";a:3:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";s:1:"z";s:6:"twelve";}s:1:"b";a:2:{s:1:"a";a:3:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";s:1:"z";s:6:"twelve";}s:1:"b";R:6;}}
*/
It just looks like the serialization is wrong.
So is this serialization wrong? Is it just that I do not know how to parse it? In theory, I should be able to unserialize a serialized array and get back what I started with. So just before I add twelve
, lets take it through that cycle:
$arr = [
'a' => ['x' => 'ten', 'y' => 'eleven'],
];
$arr['b'] =& $arr;
$arr = unserialize(serialize($arr));
$arr['a']['z'] = 'twelve';
/*
array(2) {
["a"]=>
array(3) {
["x"]=>
string(3) "ten"
["y"]=>
string(6) "eleven"
["z"]=>
string(6) "twelve"
}
["b"]=>
&array(2) {
["a"]=>
array(2) {
["x"]=>
string(3) "ten"
["y"]=>
string(6) "eleven"
}
["b"]=>
*RECURSION*
}
}
a:2:{s:1:"a";a:3:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";s:1:"z";s:6:"twelve";}s:1:"b";a:2:{s:1:"a";a:2:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";}s:1:"b";R:6;}}
*/
Oh, whoops, where has the second twelve
go to? It looks to me like a PHP bug. The serialize is not handling the reference correctly, and so the original array CANNOT be reconstructed from the serialized array. So we have not got a hope in hell's chance of correctly parsing it, since PHP itself can't parse it.
Dhoh. Grrr.
It seems to be a problem only when the reference points to the root array. This behaves entirely as expected:
$arr = [
'a' => ['x' => 'ten', 'y' => 'eleven'],
];
$arr['a']['b'] =& $arr['a'];
/*
array(1) {
["a"]=>
&array(3) {
["x"]=>
string(3) "ten"
["y"]=>
string(6) "eleven"
["b"]=>
*RECURSION*
}
}
a:1:{s:1:"a";a:3:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";s:1:"b";R:2;}}
*/
The PHP manual says this:
You can even serialize() arrays that contain references to itself. Circular references inside the array/object you are serializing will also be stored. Any other reference will be lost.
I suspect that referencing the root of the array, it thinks it is an external variable and does not realise it is referencing itself. So, it treats it as an external variable and destroys the reference, turning it into a duplication instead.
We will just have to parse the string as it is presented. The result will be what PHP would parse it as.
I've not tried this with objects (something for another day) but have found an older reference to this bug from 2004, though I'm not sure if it ever got officially reported.
Your code doesn't support recursion. I just stubbed it out like this: