academe / SerializeParser

A PHP parser for serialized data, to be able to "peek" into serialize strings.
MIT License
4 stars 2 forks source link

Recursion #5

Open jj5 opened 7 years ago

jj5 commented 7 years ago

Your code doesn't support recursion. I just stubbed it out like this:

case 'r' :
case 'R' :

  $reader->readUntil( ';' );
  $val = '**RECURSION**';

  break;
judgej commented 7 years ago

If you have an example of a data structure we can push through as a test, we'll get his included.

judgej commented 7 years ago

Some details on the reference types (r and R) here:

http://www.phpinternalsbook.com/classes_objects/serialization.html

Example object:

$obj = new stdClass;
$obj->p1 = 'abc';
$obj->p2 = $obj;
$obj->p3 =& $obj->p1;

echo serialize($obj);

// O:8:"stdClass":3:{s:2:"p1";s:3:"abc";s:2:"p2";r:1;s:2:"p3";R:2;}

p3 is a reference to the second value in the structure. p1 is a reference to the first value (which is the whole thing). I think r is used for object references and R for explicit =& references.

I think this parser should not just declare "recursion" and walk away. So long as it can keep track of each value it encounters, then it should be able to link the references properly, so the final parsed structure will have its own proper recursion in it.

judgej commented 7 years ago

Indexing a reference to each value is easy enough. The difficulty comes in the order in which they are parsed. If a reference points forward to an element that has not yet been parsed, then we need to keep it until later to link it up. So creating back-references can be done immediately, but forward references would be kept to reference when possible (or left until the end). A two-pass parsing could also work, but is unnecessary IMO.

judgej commented 7 years ago

I'm working on this in the background. The approach I'm taking is:

  1. As each data item is parsed, add it to to a numbered list. The path to each item (a list of object property or array key names that point at the value) is kept for each data item. It is key that the list is built in the correct order, the same order as they are encountered in the serialized string.
  2. Encountering a reference (R or r) will result in the reference number being stored as an intermediate (temporary) value and the path to the reference being stored in a list like the values. The list does not need to be ordered in that case - just a stack.
  3. Once the full data structure has been parsed and built, the references can be put in. Given the two lists built up, we know where the references are and what they point to numerically. The reference number can be looked up in the first ordered list to get the path, which is then navigated to and a reference can replace the numeric value. The reference will be either a full reference (=& $scalar) or a reference-link pointer to an object (= $object).

Notes:

Hopefully that's clear. I'm just posting this to avoid duplicated effort, and to show it's not that simple (probably why none of the C/Python/C# libraries I've found for tackling this have even attempted recursion on the source data.

judgej commented 7 years ago

Been playing with the recursion over the weekend, and it seems that the way PHP serializes it is rather bizarre. Take this as an example:

$arr = [
    'a' => 'one',
];

$arr['b'] =& $arr['a'];

var_dump($arr);
echo serialize($arr);

/*
array(2) {
  ["a"]=>
  &string(3) "one"
  ["b"]=>
  &string(3) "one"
}
a:2:{s:1:"a";s:3:"one";s:1:"b";R:2;}
*/

Here element b references element a. The var_dump shows the value of element b since that is what it contains (a and b share the same source data). The serialize does something different - it provides the value for a - the first time the shared data is encountered. The second time it is encountered, it is shown as a hard reference to data item number 2 (the whole array is item number 1, and a is item number 2). This is exactly what I would expect, and that's easy enough to parse.

judgej commented 7 years ago

If a is itself an array, then it works just the same:

$arr = [
    'a' => ['x' => 'ten', 'y' => 'eleven'],
];

$arr['b'] =& $arr['a'];

/*
array(2) {
  ["a"]=>
  &array(2) {
    ["x"]=>
    string(3) "ten"
    ["y"]=>
    string(6) "eleven"
  }
  ["b"]=>
  &array(2) {
    ["x"]=>
    string(3) "ten"
    ["y"]=>
    string(6) "eleven"
  }
}
a:2:{s:1:"a";a:2:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";}s:1:"b";R:2;}
*/

Note that the "ten" and "eleven" elements are shown in the var_dump() for convenience, but only appear once in the serialization. Again, simple to parse.

judgej commented 7 years ago

Now this is where is starts to get crazy with the recursion. If a references the root array rather than the a element, this is what happens:

$arr = [
    'a' => ['x' => 'ten', 'y' => 'eleven'],
];

$arr['b'] =& $arr;

/*
array(2) {
  ["a"]=>
  array(2) {
    ["x"]=>
    string(3) "ten"
    ["y"]=>
    string(6) "eleven"
  }
  ["b"]=>
  &array(2) {
    ["a"]=>
    array(2) {
      ["x"]=>
      string(3) "ten"
      ["y"]=>
      string(6) "eleven"
    }
    ["b"]=>
    *RECURSION*
  }
}
a:2:{s:1:"a";a:2:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";}s:1:"b";a:2:{s:1:"a";a:2:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";}s:1:"b";R:5;}}
*/

Again, the var_dump() shows the value that is referenced, and recognises where recursion occurs and labels it appropriately. That's good and consistent with the previous examples.

But now look at the serialized string. Suddenly the source (root) array is being replicated - it is NOT a reference any more. The b in that duplication is a reference though, but to the copy of the root array (data item number 5 and not data item number 2.

If I add an extra element to the a array then I see it appear twice, so internally the data is a reference:

$arr['a']['z'] = 'twelve';

/*
array(2) {
  ["a"]=>
  array(3) {
    ["x"]=>
    string(3) "ten"
    ["y"]=>
    string(6) "eleven"
    ["z"]=>
    string(6) "twelve"
  }
  ["b"]=>
  &array(2) {
    ["a"]=>
    array(3) {
      ["x"]=>
      string(3) "ten"
      ["y"]=>
      string(6) "eleven"
      ["z"]=>
      string(6) "twelve"
    }
    ["b"]=>
    *RECURSION*
  }
}
a:2:{s:1:"a";a:3:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";s:1:"z";s:6:"twelve";}s:1:"b";a:2:{s:1:"a";a:3:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";s:1:"z";s:6:"twelve";}s:1:"b";R:6;}}
*/

It just looks like the serialization is wrong.

judgej commented 7 years ago

So is this serialization wrong? Is it just that I do not know how to parse it? In theory, I should be able to unserialize a serialized array and get back what I started with. So just before I add twelve, lets take it through that cycle:

$arr = [
    'a' => ['x' => 'ten', 'y' => 'eleven'],
];

$arr['b'] =& $arr;

$arr = unserialize(serialize($arr));

$arr['a']['z'] = 'twelve';

/*
array(2) {
  ["a"]=>
  array(3) {
    ["x"]=>
    string(3) "ten"
    ["y"]=>
    string(6) "eleven"
    ["z"]=>
    string(6) "twelve"
  }
  ["b"]=>
  &array(2) {
    ["a"]=>
    array(2) {
      ["x"]=>
      string(3) "ten"
      ["y"]=>
      string(6) "eleven"
    }
    ["b"]=>
    *RECURSION*
  }
}
a:2:{s:1:"a";a:3:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";s:1:"z";s:6:"twelve";}s:1:"b";a:2:{s:1:"a";a:2:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";}s:1:"b";R:6;}}
*/

Oh, whoops, where has the second twelve go to? It looks to me like a PHP bug. The serialize is not handling the reference correctly, and so the original array CANNOT be reconstructed from the serialized array. So we have not got a hope in hell's chance of correctly parsing it, since PHP itself can't parse it.

Dhoh. Grrr.

judgej commented 7 years ago

It seems to be a problem only when the reference points to the root array. This behaves entirely as expected:

$arr = [
    'a' => ['x' => 'ten', 'y' => 'eleven'],
];

$arr['a']['b'] =& $arr['a'];

/*
array(1) {
  ["a"]=>
  &array(3) {
    ["x"]=>
    string(3) "ten"
    ["y"]=>
    string(6) "eleven"
    ["b"]=>
    *RECURSION*
  }
}
a:1:{s:1:"a";a:3:{s:1:"x";s:3:"ten";s:1:"y";s:6:"eleven";s:1:"b";R:2;}}
*/
judgej commented 7 years ago

The PHP manual says this:

You can even serialize() arrays that contain references to itself. Circular references inside the array/object you are serializing will also be stored. Any other reference will be lost.

I suspect that referencing the root of the array, it thinks it is an external variable and does not realise it is referencing itself. So, it treats it as an external variable and destroys the reference, turning it into a duplication instead.

We will just have to parse the string as it is presented. The result will be what PHP would parse it as.

judgej commented 7 years ago

I've not tried this with objects (something for another day) but have found an older reference to this bug from 2004, though I'm not sure if it ever got officially reported.