braoult / chessParser

pgn to json/php array parser
GNU Lesser General Public License v3.0
0 stars 0 forks source link

cleanPgn() is messy #1

Closed braoult closed 1 year ago

braoult commented 1 year ago

The function tries to edit the pgn data with regexes. It appears that:

  1. Next one adds a \n between ] and [, when separated by 0 to 10 (!) blank characters. *Should be simplified with `\s. This will break PGN if movetext section contains a line like{ [action "arg"]followed by a line like[action xxx]`.**
    $c = preg_replace('/"\]\s{0,10}\[/s', "\"]\n[", $c);
  2. Set 2 newlines between between tags and movetext sections (when separated by 0 to 10 (!) blank characters. *Should be simplified with `\s. This will break PGN if we have a movetext section like:{ [%command "string"] 1.e4 }**
    $c = preg_replace('/"\]\s{0,10}([\.0-9]|{)/s', "\"]\n\n$1", $c);
  3. Remove [%emt, but not other similar constructions, like [%clk, etc... Should be removed
    $c = preg_replace("/{\s{0,6}\[%emt[^\}]*?\}/", "", $c);
  4. Add a space between ( and {. If parser if correct, this should not be needed.
    $c = str_replace("({", "( {", $c);
  5. Supposed to match { XXX[ YYY}. However, will only work if strlen(ZZZ) is zero or one, I don't see any reason for this to happen. It will fail also if comment is split between lines. The first [ is replaced by -SB-. Reverted in rule 11.
    $c = preg_replace("/{([^\[]*?)\[([^}]?)}/s", '{$1-SB-$2}', $c);
  6. Remove tabs and carriage return.
    $c = preg_replace("/\r/s", "", $c);
    $c = preg_replace("/\t/s", "", $c);
  7. Looks like a duplicate of rule 1.
    $c = preg_replace("/\]\s+\[/s", "]\n[", $c);
  8. Remove one space before [. Why only one ?
    $c = str_replace(" [", "[", $c);
  9. Sets two \n between non tag line and tag line. This will break PGN if [...] within a comment starts a line after a line ending with a dot or a digit. For example a line ending with { 1.e4 followed by a line like [%action ...]}.
    $c = preg_replace("/([^\]])(\n+)\[/si", "$1\n\n[", $c);
  10. Max 2 consecutive newlines.
    $c = preg_replace("/\n{3,}/s", "\n\n", $c);
  11. Revert rule 5.
    $c = str_replace("-SB-", "[", $c);
  12. Sets correctly castle with O (the letter, not zeroes).
    $c = str_replace("0-0-0", "O-O-O", $c);
    $c = str_replace("0-0", "O-O", $c);
  13. Remove anything before first [. Will break PGN if first game has no tags.
    $c = preg_replace('/^([^\[])*?\[/', '[', $c);
braoult commented 1 year ago

Fixed with #5ca6c49. To detect brackets within comments, I use PCRE negative lookahead, such as: \[(?!(?:{[^}]*}|[^}])*$). See: this answer on startoverflow.