justinliang1020 commented 11 months ago

Initial code for adding an affine transformation onto a base embedding model (in this case BAAI/bge-small-en-v1.5

Based on the finetune model found here: https://github.com/567-labs/fastllm/blob/main/applications/finetune-embedding/model.py

This is currently draft code, only the forward pass an init functions have been modified so far to add in the base embedding model

Main.py is just sample code on running the model

To run, run python main.py in the directory

EDIT:

Refactored the entire PR. It's now mainly based on the finetune-embeddings directory. The logic behind this is that the base embedding model doesn't need to be inside the model.py file since the base embedding weights are frozen and we're just training on the liner adapter model on top of it.

local training with optuna now works with a sample dataset
edited dataset.py to take in text pairs instead of embeddings directly. it also takes in an embedding model as a parameter
removed target_similarity from the dataset as a data field since it's not present in most "pairs" datasets

TODO:

get inference working with a checkpoint
get it on modal
refactor out the logging statements in to another function

justinliang1020 commented 11 months ago

Pushed some new code for just getting training/inference on modal, no need to review it yet cause it's just proof of concept

justinliang1020 commented 10 months ago

Pushed the commits that I had thought I had pushed earlier (but accidentally didn't). What I did:

refactored model code to have a forward pass to output one embedding, while the training/val/test are modified to use bi-encoder style training with the new forward pass
added real dataset using huggingface dataset library (works on modal and local)
use pathlib for path
modified dataset to only calculate base embeddings on retrieval

What I plan on doing:

continue refactoring out old code (like old dataset stuff)
change model's forward pass to go from (base embedding -> finetune embedding), to (text -> finetune embedding). this will be done by adding the embedding model to the finetune model
make sure the new forward pass is compatible with huggingface inference
research spike into parallel GPU optuna training with modal

justinliang1020 commented 10 months ago

Closing PR since we are pivoting from pytorch to sentence transformers implementation

567-labs / fastllm

[Draft] initial affine transformation model #19

EDIT: