[Evals] Evaluation docs improvements

Autogenerated from Gemini:
This text reveals several areas where the documentation for Genkit, particularly around evaluation, could be improved:

* **Clarify how evaluators are standardized.**  The text acknowledges that while evaluation metrics like Faithfulness and Answer Relevance are becoming standardized, their implementation can vary. The documentation should provide more concrete information on this, perhaps by:
    *  Giving specific examples of how implementations can differ.
    *  Offering guidance on choosing the best implementation for different use cases.
    *  Explaining how Genkit handles these variations to ensure consistency.

* **Provide more guidance on quantifying output variables.** The text mentions that users can define custom evaluation metrics, but it should offer more support on how to do this effectively.  Consider adding:
    *  Examples of quantifying different types of outputs.
    *  Best practices for designing custom metrics.
    *  A step-by-step guide to implementing custom evaluators.

* **Expand on the scope of pre-defined evaluators.** Users need a clearer understanding of what metrics like "Maliciousness" actually measure. The documentation should:
    *  Provide detailed explanations of each pre-defined metric.
    *  Clarify which RAGAS metrics are included in Genkit.
    *  Offer examples of how these metrics are used in practice.

* **Improve the description of "Maliciousness"**. The current explanation is vague. The documentation should clearly define what constitutes "maliciousness" in the context of LLMs and how the evaluator identifies it.

* **Clarify the analogy to testing.** While the text likens evaluators to E2E testing, it could be more explicit about how they fit into the development process. This could involve:
    *  Explaining when and how to use evaluators during development.
    *  Providing examples of how evaluators can help identify regressions.
    *  Discussing how evaluators can be integrated into a CI/CD pipeline.

By addressing these points, the documentation can better support users in understanding and effectively using Genkit's evaluation features.
Context: https://discord.com/channels/1255578482214305893/1281391213550895124/1282325935038926868
firebase / genkit

[Evals] Evaluation docs improvements #985