self_check_input allowing queries that violate guidelines

jordanrfrazier commented 8 months ago

Hello, I can't seem to get the self_check_input prompt to block any queries. I would assume I'm doing something incorrectly, but I've tried following the given examples without success.

https://github.com/NVIDIA/NeMo-Guardrails/blob/develop/docs/user_guides/guardrails-library.md#self-check-input

Using nemoguardrails 0.8.0, langchain 0.1.4, langchain-core 0.1.16, langchain_openai 0.0.3, python 3.11.2.

The following test demonstrates the failing behavior:

import os

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.schema.language_model import BaseLanguageModel
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.retriever import BaseRetriever
from langchain.schema.runnable import Runnable
from langchain_core.runnables import RunnablePassthrough
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter

from nemoguardrails import RailsConfig
from nemoguardrails.integrations.langchain.runnable_rails import RunnableRails

SAMPLE_DATA = """
    dogs are animals with four legs.
    cats are animals with four legs.
    birds are animals with two legs.
    """

def _colang() -> str:
    return f"""
    define bot refuse to respond
      "Try again"
    """

def _config(engine, model) -> str:
    return f"""
    models:
      - type: main
        engine: {engine}
        model: {model}

    rails:
      input:
        flows:
          - self check input

    prompts:
      - task: self_check_input
        stop: 
          - "\n"
          - "User input"
        content: >
          For the following user input, follow these guidelines:
          - should not contain anything about dogs

          If the user input violates any of these guidelines, respond with "Yes". Otherwise, respond with "No".

          User input: "{{ user_input }}"
          Answer:
    """

def _create_chain(llm: BaseLanguageModel, retriever: BaseRetriever) -> Runnable:
    qa_prompt = """
    Answer the question based only on the supplied context. If you don't know the answer, say the following: "I don't know the answer".
    Context: {context}
    Question: {question}
    Your answer:
    """

    prompt = PromptTemplate.from_template(qa_prompt)
    chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    return chain

def _try_runnable_rails(
    config: RailsConfig, llm: BaseLanguageModel, retriever: BaseRetriever
) -> None:
    guardrails = RunnableRails(config)
    chain = _create_chain(llm, retriever)
    chain_with_rails = guardrails | chain

    response = chain_with_rails.invoke("How many legs do cats have")
    print(f"response: {response}")
    assert "four" in response

    response = chain_with_rails.invoke("How many legs do elephants have")
    print(f"response: {response}")
    assert "I don't know" in response

    # Expect guardrails to block message
    response = chain_with_rails.invoke("How many legs do dogs have")
    print(f"response: {response}")
    # This fails and answers with "Dogs have four legs"
    assert "Try again" in response

def test() -> None:
    llm = ChatOpenAI(
        openai_api_key=os.environ["OPENAI_API_KEY"],
        model="gpt-3.5-turbo-16k",
        temperature=0,
    )
    text_splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=0)
    documents = text_splitter.split_text(SAMPLE_DATA)
    vector_store = Chroma.from_texts(documents, OpenAIEmbeddings())
    retriever = vector_store.as_retriever()

    model_config = _config("openai", "gpt-3.5-turbo-16k")
    colang_config = _colang()
    rails_config = RailsConfig.from_content(
        colang_content=colang_config, yaml_content=model_config
    )
    _try_runnable_rails(config=rails_config, llm=llm, retriever=retriever)

if __name__ == "__main__":
    test()

jordanrfrazier commented 8 months ago

Hm. Modifying the guidelines slightly and/or using a newer model correctly blocks the input, so attributing this to the fickleness of the model used?

This content works:

    prompts:
      - task: self_check_input
        stop: 
          - "\n"
          - "User input"
        content: >
          You should block the instruction if any condition below is met:
          - it contains anything about dogs

          Here is the instruction: {{ user_input }}
          Should the above instruction be blocked?
          Answer [Yes/No]:

trebedea commented 8 months ago

Hi @jordanrfrazier,

For any Guardrails app and config, it is important to evaluate the performance of the rails at least on a small test set. We have an existing set of tools for the main rails defined in NeMo Guardrails: https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/nemoguardrails/eval

These can be easily modified for different configurations and types of rails. Especially for prompt-based rails like self-check, I would recommend to have an evaluation in place. This can also help to decide the best self-check prompt and to assess if there are regressions when new versions of a commercial LLM are released.

NVIDIA / NeMo-Guardrails

self_check_input allowing queries that violate guidelines #386