Watts-Lab / team_comm_tools

An open-source Python library that turns multiparty conversational data into social-science backed features.
https://teamcommtools.seas.upenn.edu/
MIT License
3 stars 5 forks source link

Allow User to Customize Vectors #205

Open xehu opened 6 months ago

xehu commented 6 months ago

Allow Vectors Other than SBERT

Some of our features require vectorizing the chat data; by default, we use SBERT vectors, and we have a utility that checks that the embeddings exist in the appropriate vector folder. If they do not exist, we generate SBERT vectors on behalf of the user: https://github.com/Watts-Lab/team-process-map/blob/main/feature_engine/utils/check_embeddings.py

However, we should make it possible for users to use other vectorization methods --- whether it's Word2Vec, GloVE, GPT, or something else.

Document the possibility for users to pass in their own vectors.

One basic option is to simply document clearly that the user can vectorize their data on their own, and put the vectors into the relevant path. We will then use whatever the user provides. However, we then need to make expectations clear:

Before using the vectors, add checks/assertions that we have what we expect.

If we are allowing users to pass in the vectors, we should then apply basic checks --- for example, that the table of vectors is the same length as the chat data (as we assume there is one vector per chat / message). These should take place as additions to the check embeddings function

If the user-provided vectors don't meet the expectations, we may then want to (by default) generate new vectors for them.

[EDIT - 7/31/24 - COMPLETE] Allow Users to Force-Regenerate Vector Data

Currently, the check_embeddings file checks whether embeddings exist for a dataset, and generates them if they do not yet exist. However, if they already exist, there is no way to force regenerating them (unless you delete the files). This design assumed that datasets do not change; however, in reality, they do change! We add new rows to test datasets all the time, and we therefore make it possible to specify when to regenerate embeddings.

Thus, the ask is relatively simple: add a flag into the FeatureBuilder constructor (which is what gets exposed to users), where they can specify that vector data needs to be regenerated even if it already exists. Then, in the check embeddings function, carry over the option.

xehu commented 6 months ago

Relevant code in feature_builder.py: we currently assume that the name of the "base file" is the last item in the output path, and we check that a vector file with such a name has not yet been created when we initialize the FeatureBuilder.

However, this bit of code relies on some assumptions that we may need to relax; see: https://github.com/Watts-Lab/team-process-map/issues/211

        ## TODO: the FeatureBuilder assumes that we are passing in an output file path that contains either "chat" or "turn"
        ### in the name, as it saves the featurized content into either a "chat" folder or "turn" folder based on user
        ### specifications. See: https://github.com/Watts-Lab/team-process-map/issues/211
        self.output_file_path_chat_level = re.sub('chat', 'turn', output_file_path_chat_level) if self.turns else output_file_path_chat_level
        # We assume that the base file name is the last item in the output path; we will use this to name the stored vectors.
        base_file_name = self.output_file_path_chat_level.split("/")[-1]
        self.vect_path = vector_directory + "sentence/" + ("turns" if self.turns else "chats") + "/" + base_file_name
        self.bert_path = vector_directory + "sentiment/" + ("turns" if self.turns else "chats") + "/" + base_file_name