Job generation service - Githubissues

SatyamMattoo commented 3 months ago

Short description

Integrated a job generation service which will generate job expressions based on user instructions.

Implementation details

The job generation service takes a text instruction, adaptor, expression, state and some metadata (like the API key) and returns the user a job expression.

The input payload should be:

{
    "api_key": "<OpenAI api key>",
    "existing_expression": "Your existing job expression",
    "adaptor": "@openfn/language-dhis2@4.0.3",
    "state": "Current state",
    "instruction": "A simple text instruction."
}

Tasks

[x] Set up a basic job generation service.
[x] Integrate RAG service to retrieve relevant docs.
[x] Integrate describe_adaptor service to add context about the adaptor
[x] Use own model client instead of the inference engine
[x] Adding visualization on the context drawn by the model
[x] Making the embedding service optional
[ ] Optimize RAG and prompt generation

This service currently uses the RAG service along with the describe adaptor service to add context to the prompts and improve the results.

josephjclark commented 3 months ago

@SatyamMattoo Please summarise the outstanding work here in a list in the PR description. See the tasks list here for inspiration: https://github.com/OpenFn/apollo/pull/44

I also want you to think about how you will show me the results of job generation on different inputs. You've shown me one example but we need a system where it's easy to show the current state of the PR.

Samples saved in the repo, comments in the PR thread, examples in the PR description - all valid, but we need one solution please.

I will try and get you a nice list of inputs and sample outputs so that we can look at a wider range.

SatyamMattoo commented 3 months ago

@josephjclark Sure sir, once I get a set of inputs and expected outputs, I can either add a script to run and test them or showcase the results in the comments (or both).

josephjclark commented 3 months ago

@SatyamMattoo Start with the sample you DO have, work out a system, and we can extend it to more when I work out a nice set for you.

And drop the "sir" please, you're making me feel old!

josephjclark commented 3 months ago

Hi @SatyamMattoo, here are some job descriptions for you:

Create a job which filters an array of Commcare visit events. The visits will be passed downstream in state.data. Sort the events into two lists of patients: those with an IHS number (defined by the key 'ihs_number', and those without. Save the two arrays to state, remove any other data, and return the state for the next job.

You can use this is an input state to help it:

{
  "data": {
    "attachments": {},
    "case_id": "17cf848d-a356-4e43-8d92-xxxx",
    "ihs_number": "9539-91250-12490",
    "closed": false,
    "date_closed": null,
    "date_modified": "2024-04-18T11:37:52.884000Z",
    "domain": "bob",
    "indices": {
      "parent": {
        "case_id": "6a3c4bd2-f660-4d0b-8393-yyyy",
        "case_type": "orang",
        "relationship": "child"
      }
    },
    "properties": {
      "next_step": "no_follow_up",
      "gender": "male",
      "visit_type": "doctor_consult",
      "where_did_the_visit_take_place": "mobile_clinic",
      "case_type": "kunjungan",
      "site_id": "93a719f3-9aa7-4fdc-bece-xxxx",
      "visit_date": "2024-04-18",
      "how_many_alcoholic_drinks_per_week": "one_to_three",
      "date_opened": "2024-04-18T11:37:52.884",
      "prescription_5_amount": "",
      "invoice_amount": "79000",
      "provider": "Provider_Name_From_Staff_Lookup_Table",
      "tuberculosis": "",
      "prescription_5": "",
      "non-prescribed_drug_use": "never_a_drug_user",
      "allergies": "",
      "prescription_5_dose": "",
      "external_id": null,
      "dusun_name": "Cali TEST",
      "location_of_mobile_clinic": "Sempurna",
      "mental_health_next_hg_visit_date": "2024-05-18",
      "hepatitis": "",
      "prescription_2": ""
    },
    "server_date_modified": "2024-04-18T11:37:52.944381Z",
    "user_id": "aaaaabbbbbccccddd",
    "xform_ids": [
      "fa8031ac-da15-4a59-833a-waffle"
    ]
  }

}

I think this is test data but I've removed some noise and scrubbed some ids just incase

2.Given a payload of metadata about fridge operating temperatures, aggregate all the records belonging to each fridge. The fridge id is in the LSER field. There may be hundreds of items in the data, with dozens of records per item. Save the list of temperatures (TVC) for each fridge in an object on state, like `{ records: { "406c9f14667442a7924fbe6ac8b98185": [6.9, 6.9, 7.0] } }`. Once the data has been aggregated, upload it to redis using the fridge id and date (ADOP) as the key, like "<fridge-id>:<ADOP>"

The input data would look like this:

{
    "data": [
        {
            "ADOP": "2019-11-09",
            "AMFR": "Wheatsheaf Industries",
            "EMSV": "0.1.x",
             "LSER": "406c9f14667442a7924fbe6ac8b98185",
            "records": [
                {
                    "ABST": "20240524T172814Z",
                    "BEMD": 14.4,
                    "HAMB": 75.6,
                    "TVC": 6.9,
                    "TCON": 26.9
                },
                {
                    "ABST": "20240524T173814Z",
                    "BEMD": 14.4,
                    "HAMB": 75.7,
                    "TVC": 6.9,
                    "TCON": 26.8
                },
                {
                    "ABST": "20240524T174814Z",
                    "BEMD": 14.4,
                    "HAMB": 75.9,
                    "TVC": 6.9,
                    "TCON": 26.8
                },
            ]
        }
    ]
}

The redis adaptor is brand new, there's a 0% chance that the model has been trained on it (aside from what goes in the RAG).

I've given verbose descriptions here - I'd be interested to know how the model performs if some details are removed from th user prompts.

SatyamMattoo commented 3 months ago

Hey @josephjclark, For the first input,

Results with embeddings:

Screenshot 2024-08-14 214400

Results without embeddings:

Screenshot 2024-08-14 214542

Results when state is not provided:

For the second one,

Results with embeddings:

Screenshot 2024-08-14 215048

Results without embeddings:

Screenshot 2024-08-14 215006

Results when state is not provided:

Screenshot 2024-08-14 215259

josephjclark commented 3 months ago

@SatyamMattoo yeah we're going to need a better way to share and discuss these results. Screenshots and long github comment threads aren't going to cut it.

I asked you to explore solutions for doing this effectively. Did you have any other ideas?

SatyamMattoo commented 3 months ago

@josephjclark Yes, I have been working on a script that can take multiple inputs and will write the outputs to a .md file. There are some issues with that, hopefully I will push that code tomorrow.

josephjclark commented 3 months ago

@SatyamMattoo No rush, please take your time and share those results when you're ready.

SatyamMattoo commented 3 months ago

@josephjclark I have pushed the changes that includes the script to test multiple inputs with a single command poetry run python services/gen_job/job_processor.py -i tmp/input.json -o tmp/output.md. The input has to be a json file with an array of different inputs and it processes them one by one and writes the output to a file.

Also, there is a problem with the commcare adaptor. It seemed to be working fine before but now it seems to return me this error, while other adaptors are working fine.

 descriptions = [adaptor_docs[doc]["description"] for doc in adaptor_docs]
                    ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
TypeError: string indices must be integers, not 'str'

josephjclark commented 2 months ago

@SatyamMattoo thank you and sorry for the late reply. Somehow this comment escaped me and I've only just seen it.

Can you commit some examples of input and output to the repo so that I can easily see?

Presumably commcare doesn't have a description or something? You should investigate what the problem is and probably build a workaround. If you can share the commcare structure maybe I can advise a fix. But apollo has no control over what those docs look like, so it should be reasonably robust to inconsistent data structures. The docs change all the time and bugs do occur