Vertex AI pipeline - IndexError: Invalid key: 0 is out of bounds for size 0

kk2491 commented 3 months ago

Expected Behavior

The fine-tuning of the foundation model should complete without any issues.

Actual Behavior

The fine-tuning step gets terminated. The details provided below:

Training framework - Google collab
Model used - Llama2-7B
Fine-tuning method - PEFT    Number of samples in Training Set - 100   Number of samples in Eval Set - 20
Format of the training data - jsonl Example sample is given below -

{"text": "### Human: What is arithmatic mean? ### Assistant: The arithmetic mean, or simply the mean, is the average of a set of numbers obtained by adding them up and dividing by the total count of numbers."}
{"text": "### Human: What is geometric mean? ### Assistant: The geometric mean is a measure of central tendency calculated by multiplying all values in a dataset and then taking the nth root of the product, where n is the total number of values."}

Vertex pipeline parameters :

pipeline_parameters = {
    "base_model": base_model,
    "dataset_name": dataset_name,
    "prediction_accelerator_type": prediction_accelerator_type,
    "training_accelerator_type": training_accelerator_type,
    "training_precision_mode": training_precision_mode,
    "training_lora_rank": 16,
    "training_lora_alpha": 32,
    "training_lora_dropout": 0.05,
    "training_steps": 20,
    "training_warmup_steps": 10,
    "training_learning_rate": 2e-4,
    "evaluation_steps": 10,
    "evaluation_limit": 1,
}

When I execute the training process, I get the below error:

raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")  
IndexError: Invalid key: 0 is out of bounds for size 0

Can you please help in understanding the below question?

Is the format of training data correct ? I used the format which was given as default example in Collab notebook, you can find the dataset here
Is the number of samples too less ?
Is there anything I am missing here ?

Steps to Reproduce the Problem

Specifications

Version:
Platform:

gericdong commented 3 months ago

@kk2491 can you please let me know which notebook you ran?

kk2491 commented 3 months ago

Hi @gericdong I am using the below notebook

model_garden_pytorch_llama2_peft_finetuning.ipynb

Thank you,
KK

gericdong commented 3 months ago

@genquan9: can you please assist with this? Thank you.

genquan9 commented 3 months ago

If you do training from HF datasets, you can input sth like: timdettmers/openassistant-guanaco directly.

but, if you use dataset json stored in gcs, you should use the format as: {"input_text":"TRANSCRIPT: \nREASON FOR EVALUATION:,\n\n LABEL:","output_text":"Chiropractic"}

genquan9 commented 3 months ago

The team is verifying the notebook with pipelines again.

kk2491 commented 3 months ago

@genquan9 Thanks for the response.

I am not using the dataset from the GCP bucket.
I have created my own dataset in huggingface following the format of timdettmers/openassistant-guanaco, you can find the dataset here.

Thank you,
KK

kk2491 commented 3 months ago

@genquan9 @gericdong Sorry to bother you. Did you get chance to look into the above issue?

Thank you, KK

jismailyan-google commented 3 months ago

Hi @kk2491, I was able to reproduce the issue. Please try again but set the evaluation_limit to 100.

kk2491 commented 3 months ago

@jismailyan-google Thanks for the suggestion. Just out of curiosity, did you also try with my dataset (from here)?

Thank you,
KK

kk2491 commented 3 months ago

@jismailyan-google Looks like the notebook for vertex-ai pipeline has been removed. However I did try the fine-tuning with evaluation_limit set to 100, the error remains the same.

kk2491 commented 3 months ago

@genquan9 @gericdong Did you get chance to look into the above issue?

Thank you, KK

jismailyan-google commented 3 months ago

Hi @kk2491,

I was able to get the tuning completed with your dataset. You can try this out, just replace the PIPELINE_ROOT_BUCKET with your GCS bucket and the SERVICE_ACCOUNT with your own.

Also, please note the updated COMPILED_PIPELINE_PATH.

COMPILED_PIPELINE_PATH = "https://us-kfp.pkg.dev/ml-pipeline/google-cloud-registry/oss-peft-llm-tuner/sha256:2e723d2eccb84d28652dd73324e0bf5dc7179f2ddb4230853cb95b0428438eb0"

pipeline_parameters = {
    "base_model_name": "Llama-2-7b",
    "dataset_name": "kk2491/test",
}

# Define and launch the Pipeline Job.
job = aiplatform.PipelineJob(
    display_name='llama2-tuner-04042024',
    template_path=COMPILED_PIPELINE_PATH,
    pipeline_root=PIPELINE_ROOT_BUCKET,
    parameter_values=pipeline_parameters,
)

job.submit(service_account=SERVICE_ACCOUNT)

Let me know if this works.

kk2491 commented 3 months ago

@jismailyan-google I tried again this time with Vertex GUI (looks like the notebook for fine-tune with vertex-ai has been removed). As per the comments provided by you, I dont have to make any changes in parameters except BUCKET and SERVICE_ACCOUNT. Hence tried with all default values, however the results remain the same.

Now I am 100% sure that I am doing some silly mistake here.. !!!

Joshwani-broadcom commented 2 months ago

I am running into the same error when trying to specify a custom dataset:

# Hugging Face dataset name or gs:// URI to a custom JSONL dataset.
dataset_name = "gs://llama-fine-tuning/training_data.jsonl"  # @param {type:"string"}

# Name of the dataset column containing training text input.
instruct_column_in_dataset = "text"  # @param {type:"string"}

# Optional. Template name or gs:// URI to a custom template.
template = ""  # @param {type:"string"}

I haven't looked, but I suspect that the image running the instruct-lora task is trying to load the gs:// URI as a huggingface dataset? Something like this: https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/community-content/vertex_model_garden/model_oss/peft/instruct_lora.py#L27.

I saw the following comment by @genquan9:

If you do training from HF datasets, you can input sth like: timdettmers/openassistant-guanaco directly.

but, if you use dataset json stored in gcs, you should use the format as: {"input_text":"TRANSCRIPT: \nREASON FOR EVALUATION:,\n\n LABEL:","output_text":"Chiropractic"}

I haven't tried this yet, but it seems that the instruct lora task needs to account for gs:// URI somehow. Does it?

kk2491 commented 2 months ago

@Joshwani-broadcom Here is how I was able to fix the error. (Worth giving a try, if not tried yet)

Each jsonl sample should contain at least 2 Human and Assistant conversation.
Each of jsonl sample should contain at least 512 words.

Looks like all of your samples are getting dropped due to one of the above reasons.

You can also find more details here. By following this I was able to fix the error, and fine-tune the llama2 model successfully.

Kindly let me know if you face any other issues.

Thank you,
KK

Joshwani-broadcom commented 2 months ago

Thank you @kk2491 - Is it true that you are using a huggingface dataset? Did you ever find success using a gs:// uri in the notebook like this:

dataset_name = "gs://llama-fine-tuning/training_data.jsonl"

?

kk2491 commented 2 months ago

Yea initially I tried with huggingface dataset and got it working. Later with the same dataset I migrated to Google bucket, it worked as expected.

Thank you,
Kk

GoogleCloudPlatform / vertex-ai-samples