elastic / elasticsearch-php

Official PHP client for Elasticsearch.
https://www.elastic.co/guide/en/elasticsearch/client/php-api/current/index.html
MIT License
35 stars 970 forks source link

Non-Determinism in Seeded PRNG w/ Default Configuration #1124

Closed cjw6k closed 3 years ago

cjw6k commented 3 years ago

Summary of problem or feature request

I can't say enough good things about the PHP client lib, you are all heroes.

PHP uses the Mersenne Twister algo for shuffle() and most of the simple random built-ins since v7.1, e.g. rand(), array_rand(), etc. Since v2.1.3 [dc91938] the elasticsearch-php client has included an option which sets the connection pool to use the randomizeHosts option by default. This triggers a call to shuffle() in ConnectionPool\AbstractConnectionPool.php and seeds the PRNG, if it hasn't been seeded already.

The common usage pattern in Laravel is to register a bunch of Service Providers during initialization. We register one for elasticsearch via the PHP client lib in a Service. The elasticsearch connection is essential and is made immediately with no specified configuration options relevant to this issue.

After it's all initialized, a common code path is to enter a handler for an artisan command on the CLI. My task is to provide such a command to R&D/QA where some data is generated randomly from a recipe shared among the team. The recipe may be supplemented with a PRNG seed to furnish reproducible samples for better discussion and further analysis (desired outcome & issue).

Since the call to shuffle() happens during init in this common usage pattern in Laravel, the PRNG is seeded before control is handed over from init to the console command handler in artisan. It's only my command that uniquely has a reason to force deterministic output from the PRNG, so it's not a simple matter to special-case the elasticsearch-php config for this one command. I don't have control until after the init has completed and the specific command handler is entered.

I may choose to seed the PRNG again at this point, which gives me deterministic output only until the algorithm is reloaded -PHP internals- less than a few hundred uses of random data later. The internal algo reload is tied back to the first time the PRNG was seeded and the seed value used, so it always comes back to non-deterministic output. Bottom line: the elasticsearch-php client lib in it's default configuration is the cause of non-determinism in my app.

I'll turn that randomizeHosts option off somehow, however, it was really buried in there. It took there-is-no-spoon hours to track down this source of non-determinism. It's not ideal that I need to turn off an option in the client, to recover the default behaviour of PHP. Can it be off by default and turned on when we need it and thereby gain awareness of it?

Now looking at the docs and randomizeHosts is only showing up in search results for the Ruby client, where it defaults to false / off. The PHP client docs should provide some search targets about non-determinism, the specific string 'randomizeHosts' and similar, to own this default config and the unintended consequences.

Is there an alternative way to achieve the pseudo-randomization of hosts in this client without seeding the MT algo, e.g. something time-based?

❤ from 🌎

System details

ezimuel commented 3 years ago

@cjw6k if I use random_int() ro randomize the ConnectionPool instead of shuffle() I will not use the MT algorithm. Let me know if this will solve your issue. Very interesting issue, BTW! :smile:

cjw6k commented 3 years ago

🤦

"I may choose to seed the PRNG again at this point, which gives me deterministic output ..." -- the entire time and this is a non-issue. What a day. 😆