UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.89k stars 2.44k forks source link

Memory Spike during training #1433

Open ShaoMinLiu-Holmusk opened 2 years ago

ShaoMinLiu-Holmusk commented 2 years ago

Memory use will remain below 2Gb most of the time during training, using the following configuration. But will soon encounter OOM at epoch-2, iteration 94.

RuntimeError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 7.44 GiB total capacity; 6.42 GiB already allocated; 78.31 MiB free; 6.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I repeated the same configuration twice, and the error occurs at the exact same step.
If I reduce the batchSize of BinarySimilarityDataset, this will delay the error from appearing.

I have masked some of the information to avoid leaking sensitive information.

configuration file:

model       : all-MiniLM-L6-v2
initialSeed: 20013

training     : {
                    epochNum      : 50, 
                    batchPerEpoch  : 100,

                    warmupFrac    : 0.1, # fraction of data use for warm up
                    optimizer     : {
                                       class     : 'AdamW',
                                       params    : {
                                         lr: 1.0e-6}
                    },
                    tasks         : {
                                      binarySimTask   : on,
                                      MLCTask         : off,
                                      tripletTask     : on
                    },
                    AMP           : False # automatic mixed precision
}

# Multi-label-classification task
binarySimTask      : {
                    data     : *****************
                    # needs to be even number
                    batchSize     : 70, # 50 seems to be the largest possible for GPU ram
}

tripletTask           : {
                    data           : ***********
                    batchSize      : 50
}

evaluation            : {
                    evaluationSteps : 50,
                    evaluators      : {
                      tripletEvaluation     : on,
                      KNNEvaluation         : on
                    }
}

tripletEvaluation       : {
                    data         : *************
                    # size         : 10000,
                    size         : 1500,
                    random_state : 1234
}

KNNEvaluation            : {
                    data         : ../********
                    taskLabels   : [
                                        *********
                    ],
                    batch_size   : 32,
                    n_neighbors  : 3,
                    random_state : 1234
}

output:
  location                      : ../results/NLItraining
  posfix                        : '***********' # give a tag to this run, appended to runID, default to empty
  checkpointSaveTotalLimit      : 2
  checkpointSaveSteps           : 100 # can be the same as the batchPerEpoch will checkpoint each epoch

Training script:

    train_objectives = []

    # Binary-label Classification task
    if moduleConfig['training']['tasks']['binarySimTask']:
        dataset = BinarySimilarityDataset(filePath=moduleConfig['binarySimTask']['data'],
                            batchSize=moduleConfig['binarySimTask']['batchSize'],
                            batchPerEpoch=moduleConfig['training']['batchPerEpoch'],
                            randomSeed=moduleConfig['initialSeed'])
        train_dataloader_binarySimilarity = DataLoader(dataset, batch_size=None)
        train_loss_binarySimilarity = losses.ContrastiveLoss(model,
                            distance_metric=losses.SiameseDistanceMetric.COSINE_DISTANCE)
        train_objectives.append([
              train_dataloader_binarySimilarity, train_loss_binarySimilarity
        ])

    if moduleConfig['training']['tasks']['tripletTask']:
        dataset = TripletFlyDataset(moduleConfig['tripletTask']['data'],
                                batchSize=moduleConfig['tripletTask']['batchSize'],
                                batchPerEpoch=moduleConfig['training']['batchPerEpoch']
                                )
        train_dataloader_TripletFlyDataset = DataLoader(dataset, batch_size=None)
        train_loss_TripletFlyDataset = losses.TripletLoss(model)
        train_objectives.append([
              train_dataloader_TripletFlyDataset, train_loss_TripletFlyDataset
        ])

    ###################################################################
    #### Evaluations
    ###################################################################
    evaluatersList = []

    if moduleConfig['evaluation']['evaluators']['tripletEvaluation']:
        dataset = pd.read_csv(moduleConfig['tripletEvaluation']['data'])
        dataset = dataset.sample(
            n = moduleConfig['tripletEvaluation'].get('size', 1500),
            random_state = moduleConfig['tripletEvaluation'].get('random_state', 1234)
            )
        testExamples = [
            InputExample(texts=[eachRow.anchor,
                                eachRow.positive,
                                eachRow.negative
                                ]
                        )
            for index, eachRow in dataset.iterrows()
        ]
        triplet_evaluator = TripletEvaluator.from_input_examples(
            testExamples
        )
        evaluatersList.append(triplet_evaluator)

    if moduleConfig['evaluation']['evaluators']['KNNEvaluation']:
        knn_evaluator = KNNEvaluator(
            dataCsv=moduleConfig['KNNEvaluation']['data'],
            taskLabels=moduleConfig['KNNEvaluation']['taskLabels'],
            batch_size=moduleConfig['KNNEvaluation']['batch_size'],
            n_neighbors=moduleConfig['KNNEvaluation']['n_neighbors'],
            random_state=moduleConfig['KNNEvaluation']['random_state']
        )
        evaluatersList.append(knn_evaluator)

    if moduleConfig['evaluation']['evaluators']['binaryEvaluation']:
        dataset = pd.read_csv(moduleConfig['binaryEvaluation']['data'])
        testExamples = [
            InputExample(texts=[eachRow.sentence_x,
                                eachRow.sentence_y],
                        label= 1 if eachRow.NLIlabel==0 else 0)
            for index, eachRow in dataset.iterrows()
        ]
        binary_evaluator = BinaryClassificationEvaluator.from_input_examples(testExamples,
                                                                        show_progress_bar=False)
        evaluatersList.append(binary_evaluator)

    dev_evaluator = MultipleEvaluater(evaluaters = evaluatersList)

    ###################################################################
    #### Training
    ###################################################################
    # others
    warmup_steps = math.ceil( 
            moduleConfig['training']['batchPerEpoch']*\
            moduleConfig['training']['epochNum'] *\
            moduleConfig['training']['warmupFrac'])
    # Train the model
    myLogger.info(f'Training begin')
    # optimizer
    if 'optimizer' not in moduleConfig['training']:
        print('over-write optimiser')
        moduleConfig['training']['optimizer']['class'] = 'AdamW'
        moduleConfig['training']['optimizer']['params'] = {'lr': 2e-5}

    model.fit(train_objectives=train_objectives,
            evaluator=dev_evaluator,
            evaluation_steps=moduleConfig['evaluation']['evaluationSteps'],
            epochs=moduleConfig['training']['epochNum'],
            warmup_steps=warmup_steps,
            output_path=model_save_path,
            use_amp=moduleConfig['training']['AMP'],          #Set to True, if your GPU supports FP16 operations,
            checkpoint_path = checkpoint_path,
            checkpoint_save_total_limit = moduleConfig['output']['checkpointSaveTotalLimit'],
            checkpoint_save_steps = moduleConfig['output']['checkpointSaveSteps'],
            optimizer_class = getattr(transformers, 
                                      moduleConfig['training']['optimizer']['class']),
            optimizer_params = moduleConfig['training']['optimizer']['params']
            )
nreimers commented 2 years ago

Maybe there is a long text sequence? Transformers have quadratic memory requirement in text length. Try to reduce the max_seq_length

ShaoMinLiu-Holmusk commented 2 years ago

Maybe there is a long text sequence? Transformers have quadratic memory requirement in text length. Try to reduce the max_seq_length

Thank you for you quick response, really appreciate it. I am mostly training with short sentences, but I will check that as well.

ShaoMinLiu-Holmusk commented 2 years ago

I did some quick analysis on the distribution of data ingested till the point the memory overflows.

It appears that the number of tokens appear in epoch 2, batch 94 does have one of the longest token count. And the visualisation of memory use does coincide with these inputs.

Screenshot 2022-02-21 at 10 29 17 AM Screenshot 2022-02-21 at 10 45 52 AM

Just one question if you don't mind, I was under the impression that even for short sequence inputs, given the specified max_seq_length, the trailing empty tokens will be padded with [PAD] tokens or truncated the excess tokens. So I always imagine that all sentences will be padded to the same length, thus tensors will use the same size.

suppose max_seq_len == 10

inputA = 'hello, nice to meet you' 
            -> 'hello, nice to meet you [PAD] [PAD] [PAD] [PAD] [PAD]' (padded to 10 tokens)
            -> tensor length == 10

inputB = 'hi, I have heard many things about you, its nice to finally meet you.' 
            -> 'hi, I have heard many things about you, its nice' (truncated to 10 tokens)
            -> tensor length == 10

Can you briefly explain why does the input sentence matter here, are they not suppose to use the same amount of memory? since they have fixed tensor dimensions after tokens being converted to index representation in the corpus?

While I understand that memory use generally increases over one forward pass. But is it correct to say that the expected maximum memory use for each epoch(each batch) should be constant? It appears that the answer to my question is no, but I cannot figure out why.

nreimers commented 2 years ago

Text is padded to the shortest amount possible, which gives you much faster training than padding to max length