NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.72k stars 2.45k forks source link

TUTORIAL BUG ONLY: Missing model.eval() in the SapBERT implementations #8161

Closed LiuRossa closed 7 months ago

LiuRossa commented 8 months ago

Describe the bug

TUTORIAL BUG ONLY: Missing model.eval() in the SapBERT implementations.

There is a mistake causing inconsistent results in the NLP tutorials for at least two notebooks:

and should also be in the stable branch NeMo/examples/nlp/entity_linking in every .py file who use model.forward(...) function.

After the training part the model is not set to eval mode and thus the Dropout remains active. This could be quite misleading for anyone who whishes to implement those methods.

Steps/Code to reproduce bug

1) Run the tutorial code for the aforementionned notebooks in Colab for instance (any setting). 2) Run the evaluation part twice: you get different results.

Wrong results printed in Notebook

Once run the block Model Evaluation, will see: Model 1 5
BERT Base Baseline 0.700000 1.000000
BERT + SAP 0.800000 1.000000

Run this block at twice, will see:

Model 1 5
BERT Base Baseline 0.700000 1.000000
BERT + SAP 0.700000 0.800000

The second line can be any other proportion possible.

In Entity Linking via Nearest Neighbor Search,

The similar score of SapBERT and most similar concepts is always changing even if the model is always the same one.

Expected behavior

The evaluation of the model should give consistent results.

One solution suggestion

Set model mode to eval each time at evaluation and restore the initial mode at the end of evaluation.

Replace the function get_embeddings(model, dataloader) by:

# Helper function to get data embeddings
def get_embeddings(model, dataloader):
    mode = model.training
    embeddings, cids = [], []
    with torch.no_grad():
        model.eval()
        for batch in tqdm(dataloader):
            input_ids, token_type_ids, attention_mask, batch_cids = batch
            batch_embeddings = model.forward(input_ids=input_ids.to(device),
                                             token_type_ids=token_type_ids.to(device),
                                             attention_mask=attention_mask.to(device))

            # Accumulate index embeddings and their corresponding IDs
            embeddings.extend(batch_embeddings.cpu().detach().numpy())
            cids.extend(batch_cids)
    # set mode back to its original value
    model.train(mode=mode)

    return embeddings, cids

It resolve the problem and the output of SapBERT is fixed.

Additional comment

We think this bug always exist as this project is created. And the author has observed this mistaking result. But he wrote

This evaluation set contains very little data, and no serious conclusions should be drawn about model performance. Top 1 accuracy should be between 0.7 and 1.0 for both models and top 5 accuracy should be between 0.8 and 1.0. 

in the section of Model Evaluation.

Actually, after done the debug, the result (on our side) is

Model 1 5
BERT Base Baseline 0.700000 1.000000
BERT + SAP 0.900000 1.000000

BERT + SAP acts always better than BERT Base Baseline even thought with the tiny example dataset.

Environment overview

Environment details

Additional context

Add any other context about the problem here. Example: GPU model

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.