Allow Vectors Other than SBERT

Some of our features require vectorizing the chat data; by default, we use SBERT vectors, and we have a utility that checks that the embeddings exist in the appropriate vector folder. If they do not exist, we generate SBERT vectors on behalf of the user: https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/check_embeddings.py

However, we should make it possible for users to use other vectorization methods --- whether it's Word2Vec, GloVE, GPT, or something else.

Document the possibility for users to pass in their own vectors.

One basic option is to simply document clearly that the user can vectorize their data on their own, and put the vectors into the relevant path. We will then use whatever the user provides. However, we then need to make expectations clear:

What are the different types of vector(s)? (Currently, we store SBERT vectors; we also run inference using RoBERTa, and we store positive/negative/neutral labels as part of the same step as calculating vectors --- even though those aren't actually "vectors" in the strict sense.)
What's the expected name of the file(s)?
What columns do we expect?
What should be in those columns? (What's the type of the vector? Go to where the vectors are read in and used, and ensure that the implicit formatting expectations are made explicit.)
How many rows should there be (it should match the number of rows in the file).
What do we expect to be in each row? (We assume, for example, that each row represents one row of chat.)

Before using the vectors, add checks/assertions that we have what we expect.

If we are allowing users to pass in the vectors, we should then apply basic checks --- for example, that the table of vectors is the same length as the chat data (as we assume there is one vector per chat / message). These should take place as additions to the check embeddings function

If the user-provided vectors don't meet the expectations, we may then want to (by default) generate new vectors for them.

Allow Users to Force-Regenerate Vector Data

Currently, the check_embeddings file checks whether embeddings exist for a dataset, and generates them if they do not yet exist. However, if they already exist, there is no way to force regenerating them (unless you delete the files). This design assumed that datasets do not change; however, in reality, they do change! We add new rows to test datasets all the time, and we therefore make it possible to specify when to regenerate embeddings.

Thus, the ask is relatively simple: add a flag into the FeatureBuilder constructor (which is what gets exposed to users), where they can specify that vector data needs to be regenerated even if it already exists. Then, in the check embeddings function, carry over the option.

Relevant code in feature_builder.py: we currently assume that the name of the "base file" is the last item in the output path, and we check that a vector file with such a name has not yet been created when we initialize the FeatureBuilder.

However, this bit of code relies on some assumptions that we may need to relax; see: https://github.com/Watts-Lab/team-process-map/issues/211

        ## TODO: the FeatureBuilder assumes that we are passing in an output file path that contains either "chat" or "turn"
        ### in the name, as it saves the featurized content into either a "chat" folder or "turn" folder based on user
        ### specifications. See: https://github.com/Watts-Lab/team-process-map/issues/211
        self.output_file_path_chat_level = re.sub('chat', 'turn', output_file_path_chat_level) if self.turns else output_file_path_chat_level
        # We assume that the base file name is the last item in the output path; we will use this to name the stored vectors.
        base_file_name = self.output_file_path_chat_level.split("/")[-1]
        self.vect_path = vector_directory + "sentence/" + ("turns" if self.turns else "chats") + "/" + base_file_name
        self.bert_path = vector_directory + "sentiment/" + ("turns" if self.turns else "chats") + "/" + base_file_name

Watts-Lab / team-process-map

Allow User to Customize Vectors #205

Allow Vectors Other than SBERT

Document the possibility for users to pass in their own vectors.

Before using the vectors, add checks/assertions that we have what we expect.

Allow Users to Force-Regenerate Vector Data