Engelberg / instaparse

Eclipse Public License 1.0
2.74k stars 148 forks source link

extremely slow parse for seemingly simple problem #126

Closed lambda-ai closed 8 years ago

lambda-ai commented 8 years ago

Hi,

sorry to be posting this as a bug instead of a message on the instaparse group: I'm waiting for acceptance of my membership there.

I'm using your fantastic library to parse chess games in PGN format, and most of the time the parsing is fast enough for my purposes (around 50 ms per game), but for some games it's incredibly slow. The game below runs for minutes without terminating. I don't have a clue how to debug this; taking away part of the input made it faster, but I don't understand at all what's special about the pieces that make it slow... Would be great to get some ideas.

My simple grammar is attached (the inline formatting caused all kinds of trouble) pgngrammer.txt

This is the game causing the pain:

[Event "XIII Torneo Magistral de Ajedrez"] [Site "Ciudad de Leon"] [Date "2000.??.??"] [Round "?"] [White "Anand, Vishy"] [Black "Polgar, Judit"] [Result "1/2-1/2"] [ECO "B42"] [EventDate "2000.06.02"] [Annotator "Fco. Sanchez Guirado Sergio Estremera"] [PlyCount "120"] 1. e4 c5 2. Nf3 e6 3. d4 cxd4 4. Nxd4 a6 {La Siciliana Paulsen es la principal defensa de Judith contra 1.e4} 5. Bd3 Qb6 {Esta no es la jugada mas habitual en esta posicion en la que hay abundante teoria de las numerosas alternativas, aunque la rapida salida de la dama a b6 responde a una interpretacion moderna que se encuentra en diversas continuaciones de la defensa siciliana y que pretende dificultar el natural desarrollo de las blancas. La posicion despues de 5...Qb6 es nueva en la practica de ambos protagonistas.} 6. c3 Nc6 {Nos encontramos en un terreno casi virgen para la teoria de aperturas.} 7. Nxc6 Qxc6 8. O-O d6 {54:13 48:48} 9. c4 Nf6 10. Qe2 {Novedad teorica de Anand} ( {se habia jugado.} 10. Nc3 Be7 11. Bg5 b6 12. Bxf6 gxf6 13. Re1 Bb7 14. Rc1 Rg8 15. b4 Kf8 16. Ne2 f5 17. Nd4 Qc7 18. Nf3 fxe4 19. Bxe4 Bxe4 20. Rxe4 Rc8 21. Nd4 Rg6 22. Re1 Bf6 23. g3 Kg8 24. a4 h5 25. b5 Bxd4 26. Qxd4 Rg4 27. Qe3 Qc5 28. Qxc5 Rxc5 29. bxa6 Ra5 30. Red1 Rxa6 31. f4 Rxa4 32. Rxd6 Rb4 33. Kg2 h4 34. Rc3 hxg3 35. hxg3 Kg7 36. Rd7 Rg6 37. Kf3 Rh6 38. Rb7 Kf6 39. Rd7 {- Morovic, I-Panno,O Buenos Aires 1997} ) 10. ... Be7 11. Nc3 {El orden ha sido poco habitual pero el tipo de posicion es bien conocido; las blancas oponen la formacion Maroczy al esquema tipo erizo de la negras.} 11. ... Nd7 12. f4 O-O 13. Be3 b6 14. Rf3 {Mostrando claramente sus intenciones, Anand se aparta de lo que podria ser un desarrollo natural para aproximarse a la posicion de un modo concreto.} 14. ... Bb7 15. Rh3 {Las negras tienen que andarse con cuidado ante golpes como e5 o Nd5} 15. ... Rfe8 {Tipica jugada defensiva de este esquema ahora el caballo podra en caso de necesidad trasladarse a f8 defendiendo h7. 43:24 39:21} 16. Rf1 {Fritz6 todav ia no resulta muy util a los jugadores a la hora de valorar la posicion, pero toda partida desemboca tarde o temprano en una fase de juego forzado y ahi es donde mostrara toda su fuerza, de momento el conocimiento posicional de los humanos es el hilo conductor de la partida.} 16. ... Rac8 {Es curioso senalar que un humano consideraria el sacrificio 17.Nd5. Sin embargo Anand debe forzar al programa para considerar esta posibilidad.Este es un ejemplo claro que demuestra como debe ser el trabajo con los modulos de analisis.} 17. Nd5 $1 {Golpe tematico empieza el juego directo. 33:32 37:20 Judith considera por unos segundos la captura pero se centra enseguida en las dos posibles retiradas de alfil siendo 17...Bd8 su preferida. Judith se ayuda ahora con Fritz6 y Juniror6} 17. ... Bd8 ( {La captura pierde segun Fritz6 que da la siguiente variante.} 17. ... exd5 18. exd5 Qc7 19. Bxh7+ Kf8 20. Bf5 g6 21. Bxd7 Qxd7 22. f5 Bg5 23. Bxg5 Rxe2 24. Bf6 {Ganando} ) 18. Qh5 Nf8 ( {Fritz6 se muestra de nuevo implacable en caso de que se les ocurra capturar la pieza, la siguiente variante es un convincente ejemplo de sus despiadados remates.} 18. ... exd5 19. Qxh7+ Kf8 20. exd5 Qa4 21. Bd4 Bf6 22. Qh8+ Ke7 23. Re3+ Ne5 24. fxe5 $1 Rxh8 25. exf6+ Kd8 26. Bxb6+ Rc7 27. fxg7 Rg8 28. Rxf7 {Con demolicion total} ) 19. Bd4 f6 ( {Si} 19. ... exd5 20. exd5 Qc7 21. Bxh7+ Nxh7 22. Qxh7+ Kf8 23. Qh8+ Ke7 24. Re3+ Kd7 25. Qxe8# {mate} ) 20. e5 {Alexei Shirov opina que la posicion esta ganada por las blancas despues de este golpe.} 20. ... f5 {31:10 16:20 la ventaja de tiempo de Anand es ya bastante importante.} ( {Fritz6 sigue condenando la captura de la pieza y en pocos segundos ofrece el camino hacia el mate.} 20. ... exd5 21. Bxh7+ Nxh7 22. Qxh7+ Kf7 23. Qh5+ Kf8 24. e6 Rxe6 25. Qh8+ Kf7 26. Rh7 Re5 27. Qxg7+ Ke6 28. f5+ Rxf5 29. Qg8# ) 21. exd6 Qd7 {la mas resistente segun Fritz6 aunque es mayoritaria la opinion tanto de modulos como de comentaristas de que las negras solo pueden ya retrasar la derrota} ( {El caballo sigue siendo tabu} 21. ... exd5 22. Bxf5 g6 23. Qh6 { Fuerza un rapido mate} ) 22. Rg3 g6 {28:03 13:40} 23. Qh6 Rc6 ( {El caballo sigue firme en d5 si} 23. ... exd5 24. Bxf5 Ne6 25. Rxg6+ hxg6 26. Bxe6+ Rxe6 27. Qh8+ Kf7 28. Qg7+ Ke8 29. Qg8# ) 24. Re1 {27:33 10:37} 24. ... Rxd6 25. Bxf5 $1 {El golpe que Fritz6 recomienda} 25. ... Qf7 26. Bxg6 $1 Nxg6 27. f5 $1 {la matanza es inminente} 27. ... e5 {Unica para defender g6} 28. Bxe5 Bxd5 {Por fin desaparece el caballo aunque es demasiado tarde.} 29. cxd5 {26:10 7:10} 29. ... Rxe5 30. Rxe5 Rf6 31. Rg5 Be7 {Con materi al de mas y mas tiempo Anand tiene ya el punto practicamente en el bolsillo. 25:23 5:09} 32. g3 { Dando un escape al rey por las casillas blancas Anand amenaza capturar en g6 forzando practicamente el abandono.} 32. ... Bf8 ( {otro intento defensivo es} 32. ... Bc5+ 33. Kg2 Qd7 ) 33. Qh4 Bg7 34. fxg6 Rf1+ 35. Kg2 Rf2+ 36. Kh3 hxg6 ( {la desesperada tambien pierde.} 36. ... Rxh2+ 37. Kxh2 Qf2+ 38. Kh3 Qf1+ 39. Kg4 Qd1+ 40. Kf5 Qc2+ 41. Qe4 {y no existe en jaque perpetuo. } ) 37. Re1 Bf6 {recupera la calidad pero la posicion sigue sin esperanzas a causa de la expuesta posicion de su rey.} ( 37. ... Rxh2+ 38. Kxh2 Qf2+ 39. Kh3 Qxe1 40. Qf4 Qh1+ 41. Kg4 { ganando} ) 38. Qh6 Bxg5 39. Qxg5 {11:32 2:27} 39. ... Qf5+ $1 {unica posibilidad practica ante Qd8} 40. Qxf5 gxf5 {Shirov recomienda de un vistazo lo que va a ocurrir en la partida.....} 41. Re6 Rxb2 42. d6 Rd2 43. d7 Rxd7 44. Rxb6 Rd2 45. Rxa6 Kf7 46. a4 Ra2 {Judith ha logrado conducir la partida a un final de torres donde siempre existen posibilidades salvadoras y mas cuando el apuro de tiempo no permite al bando fuerte precisar.} 47. a5 Ke7 {2:33 1:28 el tiempo se esta acabando.} 48. Kh4 {Anand se ha apurado de tiempo y debe encontrar el camino ganador en un siempre dificil final de torres} 48. ... Rxh2+ {a partir de aqui ambos juegan al toque ya no queda tiempo para mas.} 49. Kg5 Rg2 50. Kf4 Rf2+ 51. Ke5 Rf3 52. Ra7+ Kd8 53. Rg7 Kc8 $1 {una vez queda claro que se van a cambiar los peones del flanco de rey Judith se dirige a parar el peon a.} 54. Kd6 Rd3+ 55. Kc6 Rc3+ 56. Kb6 Rb3+ 57. Ka7 Kd8 58. a6 Kc8 59. Rf7 Rxg3 60. Rxf5 Rb3 {Y tablas. Un meritorio y valiosisimo medio punto para Judith Polgar que ha levantado una posicion por la que nadie daba nada.} 1/2-1/2

Thanks, -Mathias

Engelberg commented 8 years ago

Take a look at tips 5-7 here: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md#performance-tips

Over 90% of the performance questions I get asked can be addressed by using the insta/parses function to test whether the grammar produces multiple interpretations on a given text, and if so, identifying why and removing that ambiguity. So give that a try first and let me know what you find.

lambda-ai commented 8 years ago

Thanks, on a good chunk of that game (a chunk taking a few seconds), a single parse is returned with insta/parses, also with "unhide all". Even "partial true" only returns a singe parse.

I had been looking at the grammar and added for example the negative lookahead in the regex for Col, but that doesn't seem to help. I imagine the only places that create serious work for the parser are the Moves and Move rules, but I don't have a clue why exactly this game is so bad. It has a lot of comments, but nothing specifically harder than other games that parse fine.

I'm probably just overlooking something stupid :)

lambda-ai commented 8 years ago

ok, if I preprocess the games by removing all the comments, the parsing is very fast. I don't really need the comments, so it's fine, but I wonder what's going on: isn't the regex match supposed to be greedy, so just generating a single alternative. Not sure why that's slowing things down so much...

aengelberg commented 8 years ago

First of all, I'm guessing you're using :auto-whitespace :standard as there is no other explicit whitespace parsing in the parser you provided.

For what it's worth, the parser does complete (on my machine) under a second to find just one parse. To find all parses, however, takes a very long time.

=> (time (do (insta/parse chess-grammar input) nil))
"Elapsed time: 293.826642 msecs"
nil
=> (time (do (count (insta/parses chess-grammar input)) nil))
....takes forever

The problematic line is:

Moves         = (Comment* [MoveNumber] Move Comment* | Line)*

Specifically, your usage of Comment* twice. Here's why: If a comment were to appear between two moves, there are two ways to parse it, either after the first move, or before the second move. These small choices multiply to many possible combinations for large inputs. It's easy to find one parse, but when you fully iterate the insta/parses lazy seq, the engine has to fully resolve all those choices to guarantee you're getting all the solutions (which ends up being no more after the first one).

There are two potential solutions I see:

  1. fix that line to Moves = ([MoveNumber] Move | Line | Comment)* so that each comment is parsed in only one possible place.
  2. (more elegant solution) Write a custom whitespace parser to pass to auto-whitespace, which parses whitespace as well as comments. Then you wouldn't have to worry about them in your parser. The downside is that comments wouldn't show up in your parse tree, but I'm not sure if you care about that or not.

I'm not sure whether this parser counts as "ambiguous", because it has only one possible parse tree, just multiple ways of deriving the subsequences that ultimately get joined together to create said parse tree.

For example, compare the following two parsers on the same input.

(count (insta/parses (insta/parser "S = 'a'* ('b' 'a'*)*") "bbbabbbabbba" :trace true))
(count (insta/parses (insta/parser "S = ('a'* 'b' 'a'*)*") "bbbabbbabbba" :trace true))

These parsers parse equivalent inputs, and these expressions both return 1 as there is indeed only one possible parse tree. The second one, however, has an "internal ambiguity" similar to that in Mathias' grammar, with two ways to handle each a. In the trace output of the second one, if I look at what happens after the Successful parse. line, I see three more instances of this line:

Result for ("a"* "b" "a"*)* at index 0 (bbbabbbabb...) => ("b" "b" "b" "a" "b" "b" "b" "a" "b" "b" "b" "a")

which means that it's finding more ways to parse the same sequence of tokens, that are presumably getting suppressed at the front-end because they are duplicates. So it's just unnecessary time being spent while yielding the same result.

--Alex

lambda-ai commented 8 years ago

Thanks a lot, Alex! I should have figured that one out myself...

It's easy enough to re-write the grammar so the pre-move comments are only considered in specific places, so for sure my problem is solved.

I'm still wondering about your initial comment about the parse finishing on your machine. I double (&triple-) checked on my machine, which is pretty new. I'm running 1.4.1 on java 8, and it still doesn't finish in a time quick enough to bother waiting. I thought it might have been the lack of hotspot because I was running leiningen without any jvm-opts, but adding the "-server" in the jvm-opts also doesn't make a difference, it seems. Is it maybe an improvement in the development version of instaparse you may be running?

Anyway, problem solved for me.

Thanks again to both of you for the fantastic library and the amazing support!

-Mathias