bullet-train-pro / bullet_train-action_models

Other
4 stars 1 forks source link

Populate import action field mappings with closest string match via OpenAI #63

Closed gazayas closed 1 year ago

gazayas commented 1 year ago

Closes #61.

Depends on:

Details

OpenAI uses embeddings to get string match data represented as numbers (-1 to 1).

OpenAI has different models available which we can run our data against, and you can run the following in the rails console to see which models are available:

# Using the ruby-openai gem
client = OpenAI::Client.new(access_token: ENV["OPENAI_ACCESS_TOKEN"])
client.models.list

I went with text-similarity-babbage-001, but there are other text similarity models available so we can change it if necessary.

Vectors

In the embeddings link above, you can see the following code can be used to compare the data:

import openai, numpy as np

resp = openai.Embedding.create(
    input=["feline friends say", "meow"],
    engine="text-similarity-davinci-001")

embedding_a = resp['data'][0]['embedding']
embedding_b = resp['data'][1]['embedding']

similarity_score = np.dot(embedding_a, embedding_b)

Ruby has the Vector#dot function available to achieve the same thing, so I used this to get the CSV column name, the closest match from the model, and the similarity score in the closest_attribute_matches method that I wrote:

# original_value is from the CSV file.
# closest_match is an attribute from the model.
[{:original_value=>"surname", :closest_match=>"name", :score=>0.8904644761117861}]

The docs suggest using consine similarity as the distance function, but they also say "The choice of distance function typically doesn’t matter much," so I don't think we have to worry about using anything besides dot.

gazayas commented 1 year ago

I opened this as a WIP for a few reasons:

  1. For some reason "Don't Import" isn't working when there are more columns in the CSV than model attributes.
  2. I'm not sure how you feel about tests when it comes to this one, but I'd be glad to hear how you want to handle this one.
  3. I think I'll give myself some time just to look over it again and make sure things turned out how I wanted them to.
gazayas commented 1 year ago

Okay, this one should be good to go. With the last commit, I automatically set up the field mappings with the values passed via the select inputs. I don't think we were actually doing this before (For example say you have an attribute called "Name". You can try to choose "Don't import" for a column against main, but the attribute will still be mapped with "Name").

Besides that, system tests are passing, so I'll mark this one as ready!

jagthedrummer commented 1 year ago

@gazayas If we still need this PR can you resolve the conflict?

gazayas commented 1 year ago

@jagthedrummer Sorry for the delay on this one, the merge conflict has been resolved.

It stemmed from https://github.com/bullet-train-pro/bullet_train-action_models/commit/a8ff0bcc7befd89866e41d4a8f163e55f94d62e1 where we put analyze_file in a private method at the bottom of performs_import.rb. The original OpenAI code I wrote in this PR was in that method, so I moved it to the proper place in the file.

The test failures seem to be do to a Google Chrome issue.

jagthedrummer commented 1 year ago

I got the Chrome issue fixed up in main, then merged main into this branch, and now everything is looking good. 👍