Garrafao / durel_system_annotators

3 stars 0 forks source link

batch processing #36

Closed shafqatvirk closed 7 months ago

shafqatvirk commented 7 months ago

We have to implement some sort of batch processing of the computational annotator so that the instances can be annotated at large scale.

AinaIanemahy commented 7 months ago

The WordTransformer enables batch processing (12 at a time). We currently do not use this because we pass one sentence at a time to the WordTransformer. We have to change this bit in xl-lexeme:


def compute_embeddings_lexeme(sentence_and_token_index: list[tuple], model) -> np.ndarray:
    """
    This function computes embeddings for given sentences and token indices.

    :param sentence_and_token_index: A list of tuples, each containing a sentence and a corresponding token index.
    :type sentence_and_token_index: list[tuple]
    :param model: The model that will be used to encode the given sentences.

    :return: Embeddings for the given sentences and token indices.
    :rtype: np.ndarray
    """
    token_embeddings_output = list()
    for i, (sen, idx) in enumerate(sentence_and_token_index):
        #print(type(idx))
        idx_tuple = (int(idx.split(':')[0]),int(idx.split(':')[1]))
        #idx_tuple = ast.literal_eval(idx.split(':'))
        examples = InputExample(texts='"' + sen + '"', positions=[idx_tuple[0], idx_tuple[1]])
        outputs = model.encode(examples)

        token_embeddings_output.append(outputs)
        # print(sen,idx,outputs)
    # print(token_embeddings_output)
    token_embeddings_output = np.array(token_embeddings_output)
    return token_embeddings_output`
AinaIanemahy commented 7 months ago

I have pushed changes that use the batch processing functionality in the WordTransformer. A batch_size parameter can now be set in the settings file. e2745023baa0b5a350490053da99e2201a7014b8