NAQANET results are non-deterministic

There are two pods running the NAQANet model, and they each consistently return different results. For example, pod allennlp-demo-api-naqanet-prod-597f89dc7f-7bwhw returns with answer.value = game, players direct a nasce..., while pod allennlp-demo-api-naqanet-prod-597f89dc7f-hq96x returns answer.value = real-life orbital maneuvers.

As a result, querying the load-balanced endpoint https://demo.allennlp.org/api/naqanet/predict results in either of these two results being returned.

root@allennlp-demo-api-naqanet-prod-597f89dc7f-7bwhw:/app# curl -s http://localhost:8000/predict -H 'Content-Type: application/json' --data-raw '{"model":"naqanet","passage":"Kerbal Space Program (KSP) is a space flight simulation video game developed and published by Squad for Microsoft Windows, OS X, Linux, PlayStation 4, Xbox One, with a Wii U version that was supposed to be released at a later date. The developers have stated that the gaming landscape has changed since that announcement and more details will be released soon. In the game, players direct a nascent space program, staffed and crewed by humanoid aliens known as \"Kerbals\". The game features a realistic orbital physics engine, allowing for various real-life orbital maneuvers such as Hohmann transfer orbits and bi-elliptic transfer orbits.","question":"What does the physics engine allow for?"}' | grep -o 'value...............................'
value":"game, players direct a nasce

root@allennlp-demo-api-naqanet-prod-597f89dc7f-hq96x:/app# curl -s http://localhost:8000/predict -H 'Content-Type: application/json' --data-raw '{"model":"naqanet","passage":"Kerbal Space Program (KSP) is a space flight simulation video game developed and published by Squad for Microsoft Windows, OS X, Linux, PlayStation 4, Xbox One, with a Wii U version that was supposed to be released at a later date. The developers have stated that the gaming landscape has changed since that announcement and more details will be released soon. In the game, players direct a nascent space program, staffed and crewed by humanoid aliens known as \"Kerbals\". The game features a realistic orbital physics engine, allowing for various real-life orbital maneuvers such as Hohmann transfer orbits and bi-elliptic transfer orbits.","question":"What does the physics engine allow for?"}' | grep -o 'value...............................'
value":"real-life orbital maneuvers"

Some thoughts:

There are two pods per model deployed for prod. This is nice for availability (if the workloads must be re-arranged among cluster nodes), surges in traffic, and smooth deployments. Having only one pod per model might hide the non-determinism, but would make it vulnerable to outages during cluster workload re-arrangements, and would making non-determinism appear after deployments, making it look like a regression. So I don't advise reducing replica count from 2 to 1.
The pods backing these servers happen to be up for 9 days each. Is that unusual? Maybe state accumulated in a bad way.
Maybe the random seed isn't initialized. If it isn't already, then could prepare_environment() help? https://github.com/allenai/allennlp/blob/87c23e4a7e25d81e91ba8234eef38b5395993c51/allennlp/common/util.py#L218-L247
While non-determinism may be unavoidable in the general case, for the demo we should find a way to make it consistent, so it doesn't look broken.
For unavoidably non-deterministic models, there should be a disclaimer on the demo page, inviting people to click "Run" several times to see get a sense of the variability.

allenai / allennlp-demo

NAQANET results are non-deterministic #485