VACOTechSprint / ambient-transcription

Docs and management tasks for sprint
1 stars 0 forks source link

Upload to ASR pipeline #9

Open dahifi opened 5 months ago

dahifi commented 5 months ago

The front end will deposit uploads to a specified GCS bucket. This task is to build a pipeline that can take these uploads, and pass them to the ASR.

To create a Google Cloud Function that watches a bucket for new files and sends them to a server using the provided request URL, you'll need to write a function that triggers on google.storage.object.finalize, which is invoked when a new file is uploaded to a Google Cloud Storage bucket. The function will then send a POST request to your specified server with the file content.

Additionally, save the JSON response to another bucket. Below is the updated function that includes handling the response:

Here's an example implementation in Python:

import os
import json
import requests
from google.cloud import storage

def send_file_to_server_and_save_response(event, context):
    """Triggered by a new file uploaded to a specified Google Cloud Storage bucket.
    Args:
        event (dict): Event payload.
        context (google.cloud.functions.Context): Metadata for the event.
    """
    file_name = event['name']
    bucket_name = event['bucket']

    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_name)

    # Downloading the file to a temporary location
    temp_file_path = f"/tmp/{file_name}"
    blob.download_to_filename(temp_file_path)

    url = "http://35.245.63.100:9000/asr?task=transcribe&encode=true&output=json&diarize=true"
    files = {'audio_file': (file_name, open(temp_file_path, 'rb'), 'video/webm')}

    response = requests.post(url, files=files)

    if response.status_code == 200:
        print(f"File {file_name} was successfully sent to the server.")
        response_data = response.content

        # Define the bucket to save the JSON response
        response_bucket_name = '<YOUR_RESPONSE_BUCKET>'
        response_bucket = storage_client.bucket(response_bucket_name)

        # Define the path and name for the response file
        response_file_name = file_name + '.json'
        response_blob = response_bucket.blob(response_file_name)

        # Upload the JSON response to the specified bucket
        response_blob.upload_from_string(response_data, content_type='application/json')
        print(f"Response JSON for {file_name} saved to bucket {response_bucket_name} as {response_file_name}.")

    else:
        print(f"Failed to send file {file_name} to the server.")

    # Clean up the temporary file
    os.remove(temp_file_path)

To deploy this function, follow these steps:

  1. Ensure you have the Google Cloud SDK installed and initialized.

  2. Create a requirements.txt file with the following contents to specify the dependencies:

    google-cloud-storage
    requests
  3. Deploy the function to Google Cloud Functions with the following command, replacing <YOUR_TRIGGER_BUCKET>, <YOUR_RESPONSE_BUCKET> with the name of your Google Cloud Storage bucket:

    gcloud functions deploy send_file_to_server \
       --runtime python39 \
       --trigger-bucket <YOUR_TRIGGER_BUCKET> \
       --entry-point send_file_to_server \
       --memory 128MB \
       --timeout 540s \
       --region <YOUR_FUNCTION_REGION>

    Make sure to replace <YOUR_FUNCTION_REGION> with the region where you want to deploy your Cloud Function.

  4. Test the function by uploading a file to your specified bucket and checking the logs for successful execution.

Remember to ensure that the Cloud Function has the necessary IAM permissions to access the Google Cloud Storage buckets and the internet if it's running in a VPC-scoped environment. Additionally, ensure that the Cloud Function's service account has the Storage Object Creator role (or a custom role with equivalent permissions) on the response bucket to allow it to write the response files.

dahifi commented 5 months ago

Current code bypasses the google bucket and just interacts with the ASR server directly. We decided to try doing all the workflow stuff on the client end instead.