ContinualAI / avalanche

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
http://avalanche.continualai.org
MIT License
1.75k stars 287 forks source link

Generative Replay mps.matmul #1456

Closed Bard2803 closed 1 year ago

Bard2803 commented 1 year ago

🐛 Describe the bug The training on mps gpu goes well until it hit the last epoch for experience 0. The the following bug appears:

-- >> Start of training phase << -- 0it 00:00, ?it/s: /AppleInternal/Library/BuildRoots/c2cb9645-dafc-11ed-aa26-6ec1e3b3f7b3/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:39:0: error: 'mps.matmul' op contracting dimensions differ 3072 & 784 (mpsFileLoc): /AppleInternal/Library/BuildRoots/c2cb9645-dafc-11ed-aa26-6ec1e3b3f7b3/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:39:0: note: see current operation: %6 = "mps.matmul"(%arg0, %5) {transpose_lhs = false, transpose_rhs = false} : (tensor<20x3072xf32>, tensor<784x400xf32>) -> tensor<20x400xf32> zsh: segmentation fault python train.py

For CPU the error is slightly different

RuntimeError: mat1 and mat2 shapes cannot be multiplied (20x3072 and 784x400)

🐜 To Reproduce That is my code:

def GR():

    if torch.cuda.is_available():
        device = 'cuda:0'
        torch.backends.cudnn.benchmark = True
        device_count = torch.cuda.device_count()
        print(f"Found {device_count} CUDA GPU devices.")
    elif torch.backends.mps.is_available():
        device = 'mps'
    else:
        device = 'cpu'

    print(f'Using {device} device')

    model = SimpleCNN(num_classes=50).to(device)
    print(f"Main model {model}")

    # Load the CORe50 dataset
    core50 = CORe50(scenario="nc", mini=True, object_lvl=True)

    # the task label of each train_experience.
    print('--- Task labels:')
    print(core50.task_labels)

    # Instatiate train, validation and test streams
    core50 = benchmark_with_validation_stream(core50, 0.2)
    train_stream = core50.train_stream
    val_stream = core50.valid_stream

    optimizer = Adam(model.parameters(), lr=0.001)
    criterion = CrossEntropyLoss()

    loggers = []
    loggers.append(InteractiveLogger())

    eval_plugin = EvaluationPlugin(
    loss_metrics(epoch=True, stream=True),
    accuracy_metrics(epoch=True, stream=True),
    class_accuracy_metrics(epoch=True, stream=True),
    cpu_usage_metrics(epoch=True),
    gpu_usage_metrics(gpu_id=0, epoch=True),
    ram_usage_metrics(epoch=True),
    disk_usage_metrics(epoch=True),
    forward_transfer_metrics(stream=True),
    forgetting_metrics(stream=True),
    loggers=loggers,
    strict_checks=False)

    # CREATE THE STRATEGY INSTANCE (GenerativeReplay)
    cl_strategy = GenerativeReplay(
        model,
        optimizer,
        criterion,
        train_mb_size=20,
        train_epochs=4,
        eval_mb_size=20,
        device=device,
        evaluator=eval_plugin,
        eval_every=1)

    # TRAINING LOOP
    print("Starting experiment...")

    for train_experience in train_stream:
        print("Start of train_experience ", train_experience.current_experience)
        print(f"classes in this experience train {train_experience.classes_in_this_experience}")
        cl_strategy.train(train_experience, eval_streams=[val_stream])
        print("Training completed")

🐝 Expected behavior Should continue training

🦋 Additional context I tried decreasing the the batches and observed the memory consumption and it does not seem to have any connection with memory overhead.

Bard2803 commented 1 year ago

OK, so I am not sure if this should be called a bug. It works after implementing the generator_strategy. What I read in docs is that the VAE generator is applied by default, this didnt work for the set up I presented. I do not delete this in case some1 has similar issue.

Just add generator (do not base on the default VAE as in docs):

    # model:
    generator = MlpVAE((3, 32, 32), nhid=2, device=device)
    # optimzer:
    lr = 0.001

    optimizer_generator = Adam(
        filter(lambda p: p.requires_grad, generator.parameters()),
        lr=lr,
        weight_decay=0.0001,
    )
    # strategy (with plugin):
    generator_strategy = VAETraining(
        model=generator,
        optimizer=optimizer_generator,
        train_mb_size=100,
        train_epochs=4,
        eval_mb_size=100,
        device=device,
        plugins=[
            GenerativeReplayPlugin(
                replay_size=None,
                increasing_replay_size=False,
            )
        ],
    )

    # CREATE THE STRATEGY INSTANCE (GenerativeReplay)
    cl_strategy = GenerativeReplay(
        model,
        optimizer,
        criterion,
        train_mb_size=20,
        train_epochs=4,
        eval_mb_size=20,
        device=device,
        evaluator=eval_plugin,
        eval_every=1, 
        generator_strategy=generator_strategy)