Image Captioning AOAI Custom Skill

amitkalay commented 1 month ago

This PR introduces a chat completion prompt to do some image-processing and return the model's top response to be used for further processing. This code is called when we hit our base_url/api/summarize endpoint and pass in the appropriate header for image captioning. A concrete example is shown in the included .http file. When I called this new endpoint to tell me about the parts required to assemble a Tesla car (passed in as a base64 encoded image), I got the following response:

{
  "values": [
    {
      "warnings": null,
      "errors": [],
      "recordId": "5",
      "data": {
        "generative-caption": "This is a Tesla Model S, an electric vehicle (EV). Here's a list of key components required to make this car:\n\n```json\n{\n  \"car_body\": {\n    \"exterior\": {\n      \"front_bumper\": \"Aerodynamic front bumper with integrated sensors\",\n      \"headlights\": \"LED headlights\",\n      \"grille\": \"Tesla emblem grille\",\n      \"hood\": \"Aerodynamic hood\",\n      \"side_mirrors\": \"Side mirrors with integrated turn signals\",\n      \"doors\": \"Four doors with handles\",\n      \"windows\": \"Tinted windows\",\n      \"roof\": \"Panoramic glass roof\",\n      \"rear_bumper\": \"Rear bumper with reflectors\",\n      \"tail_lights\": \"LED tail lights\",\n      \"wheels\": \"Alloy wheels with tires\"\n    },\n    \"interior\": {\n      \"seats\": \"Leather or fabric seats\",\n      \"dashboard\": \"Digital dashboard with touchscreen display\",\n      \"steering_wheel\": \"Multi-function steering wheel\",\n      \"console\": \"Center console with storage and controls\",\n      \"airbags\": \"Front and side airbags\",\n      \"infotainment_system\": \"Touchscreen infotainment system\",\n      \"climate_control\": \"Automatic climate control\"\n    }\n  },\n  \"mechanical_components\": {\n    \"electric_motor\": \"High-performance electric motor\",\n    \"battery_pack\": \"Lithium-ion battery pack\",\n    \"transmission\": \"Single-speed transmission\",\n    \"suspension\": \"Independent suspension system\",\n    \"brakes\": \"Regenerative braking system\",\n    \"drive_system\": \"All-wheel drive system\"\n  },\n  \"electrical_components\": {\n    \"charging_port\": \"Charging port for EV charging\",\n    \"wiring_harness\": \"Electrical wiring harness\",\n    \"control_unit\": \"Electronic control unit (ECU)\",\n    \"sensors\": \"Various sensors for safety and automation\",\n    \"cameras\": \"Cameras for autopilot and safety features\",\n    \"display_screens\": \"Multiple display screens\"\n  },\n  \"safety_features\": {\n    \"autopilot\": \"Autopilot system with autonomous driving capabilities\",\n    \"collision_detection\": \"Collision detection system\",\n    \"lane_assist\": \"Lane assist system\",\n    \"parking_sensors\": \"Parking sensors\"\n  }\n}\n```"
      }
    }
  ]
}

amitkalay commented 1 month ago

currently the request is going in successfully, but the model is unable to recognize celebrities in the photo

bleroy commented 1 month ago

Those skills all seem very similar in structure, with few things changing apart from the prompt template. Is there a way we can factor the common parts?

Azure-Samples / azure-search-power-skills

Image Captioning AOAI Custom Skill #200