training on multiple datasets at once breaks

miriamrebekah commented 2 years ago

I'm trying to train on two datasets at once. I'm using npy file list files for my datsets. Is training on multiple datasets at once supported? I put both in my config file, but I just keep getting this tensor error:

--- Logging error ---
Traceback (most recent call last):
  File "/home/mtan/latent-data/vissl/vissl/utils/distributed_launcher.py", line 158, in launch_distributed
    hook_generator=hook_generator,
  File "/home/mtan/latent-data/vissl/vissl/utils/distributed_launcher.py", line 200, in _distributed_worker
    hook_generator=hook_generator,
  File "/home/mtan/latent-data/vissl/vissl/engines/engine_registry.py", line 93, in run_engine
    hook_generator=hook_generator,
  File "/home/mtan/latent-data/vissl/vissl/engines/train.py", line 46, in run_engine
    hook_generator=hook_generator,
  File "/home/mtan/latent-data/vissl/vissl/engines/train.py", line 130, in train_main
    trainer.train()
  File "/home/mtan/latent-data/vissl/vissl/trainer/trainer_main.py", line 201, in train
    raise e
  File "/home/mtan/latent-data/vissl/vissl/trainer/trainer_main.py", line 193, in train
    task = train_step_fn(task)
  File "/home/mtan/latent-data/vissl/vissl/trainer/train_steps/standard_train_step.py", line 158, in standard_train_step
    local_loss = task.loss(model_output, target)
  File "/home/mtan/anaconda3/envs/latent-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/mtan/latent-data/vissl/vissl/losses/simclr_info_nce_loss.py", line 58, in forward
    loss = self.info_criterion(normalized_output)
  File "/home/mtan/anaconda3/envs/latent-env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/mtan/latent-data/vissl/vissl/losses/simclr_info_nce_loss.py", line 144, in forward
    pos = torch.sum(similarity * self.pos_mask, 1)
RuntimeError: The size of tensor a (256) must match the size of tensor b (128) at non-singleton dimension 1

My config looks like this:

# @package _global_
config:
  VERBOSE: False
  LOG_FREQUENCY: 10
  TEST_ONLY: False
  TEST_MODEL: False
  SEED_VALUE: 0
  MULTI_PROCESSING_METHOD: forkserver
  HOOKS:
    PERF_STATS:
      MONITOR_PERF_STATS: True
      ROLLING_BTIME_FREQ: 313
  DATA:
    NUM_DATALOADER_WORKERS: 5
    TRAIN:
      DATA_SOURCES: [disk_filelist, disk_filelist]
      DATASET_NAMES: [freiburgforest_ssrl, rellis_ssrl]
      BATCHSIZE_PER_REPLICA: 64
      LABEL_TYPE: sample_index    # just an implementation detail. Label isn't used
      TRANSFORMS:
        - name: ImgReplicatePil
          num_times: 2
        - name: RandomResizedCrop
          size: 224
        - name: RandomHorizontalFlip
          p: 0.5
        - name: ImgPilColorDistortion
          strength: 1.0
        - name: ImgPilGaussianBlur
          p: 0.5
          radius_min: 0.1
          radius_max: 2.0
        - name: ToTensor
        - name: Normalize
          mean: [0.28689554, 0.32513303, 0.28389177]
          std: [0.18696375, 0.19017339, 0.18720214]
      COLLATE_FUNCTION: simclr_collator
      MMAP_MODE: True
      COPY_TO_LOCAL_DISK: False
      COPY_DESTINATION_DIR: /tmp/freiburgforest_rellis_ssrl/
      DROP_LAST: True
  TRAINER:
    TRAIN_STEP_NAME: standard_train_step
  METERS:
    name: ""
  MODEL:
    TRUNK:
      NAME: unet
    HEAD:
      PARAMS: [
        ["mlp", {"dims": [25088, 4096], "use_relu": True, "skip_last_layer_relu_bn": False}],
        ["mlp", {"dims": [4096, 128]}],
      ]
    SYNC_BN_CONFIG:
      CONVERT_BN_TO_SYNC_BN: True
      SYNC_BN_TYPE: apex
      GROUP_SIZE: 8
    AMP_PARAMS:
      USE_AMP: False
      # USE_AMP: True
      AMP_ARGS: {"opt_level": "O1"}
  LOSS:
      name: simclr_info_nce_loss
      simclr_info_nce_loss:
        temperature: 0.1
        buffer_params:
          embedding_dim: 128
  OPTIMIZER:
      name: sgd
      use_larc: True
      larc_config:
        clip: False
        trust_coefficient: 0.001
        eps: 0.00000001
      weight_decay: 0.000001
      momentum: 0.9
      nesterov: False
      num_epochs: 100
      # num_epochs: 200
      # num_epochs: 400
      # num_epochs: 500
      # num_epochs: 600
      # num_epochs: 800
      # num_epochs: 1000
      # num_epochs: 1
      # num_epochs: 2
      # num_epochs: 5
      regularize_bn: True
      regularize_bias: True
      param_schedulers:
        lr:
          auto_lr_scaling:
            auto_scale: true
            base_value: 0.3
            base_lr_batch_size: 256
          name: composite
          schedulers:
            - name: linear
              start_value: 0.6
              end_value: 4.8
            - name: cosine
              start_value: 4.8
              end_value: 0.0000
          update_interval: step
          interval_scaling: [rescaled, fixed]
          lengths: [0.1, 0.9]                 # 100ep
          # lengths: [0.05, 0.95]             # 200ep
          # lengths: [0.025, 0.975]           # 400ep
          # lengths: [0.02, 0.98]             # 500ep
          # lengths: [0.0166667, 0.9833333]   # 600ep
          # lengths: [0.0125, 0.9875]         # 800ep
          # lengths: [0.01, 0.99]             # 1000ep
          # lengths: [0.0128, 0.9872]         # 1ep IG-1B
          # lengths: [0.00641, 0.99359]       # 2ep IG-1B
          # lengths: [0.002563, 0.997437]     # 5ep IG-1B = 50 ep IG-100M
  DISTRIBUTED:
    BACKEND: nccl
    NUM_NODES: 1
    NUM_PROC_PER_NODE: 1
    RUN_ID: auto
    INIT_METHOD: tcp
    #NCCL_DEBUG: True
  MACHINE:
    DEVICE: gpu
  CHECKPOINT:
    DIR: "./models/pretrain_freiburgforest_rellis"
    AUTO_RESUME: True
    CHECKPOINT_FREQUENCY: 5
    CHECKPOINT_ITER_FREQUENCY: -1  # set this variable to checkpoint every few iterations

Pedrexus commented 2 years ago

I am also seeing this error on any training I try with simclr_info_nce_loss or multicrop_simclr_info_nce_loss. It happens on the first epoch, likely when it is close to finishing.

    File "/opt/conda/lib/python3.8/site-packages/vissl/losses/simclr_info_nce_loss.py", line 144, in forward
        pos = torch.sum(similarity * self.pos_mask, 1)
RuntimeError: The size of tensor a (992) must match the size of tensor b (1024) at non-singleton dimension 1

Pedrexus commented 2 years ago

After looking at how the loss is implemented, I believe to have found the reason. The tensor "a" is simply an unfinished batch, so you might have to remove these extra images or change the batchsize.

I have 386032 images, 8 GPUs, batchsize = 64.

Thus, on my case you can see that

386032 % (8 * 64) * 2 == 992

@miriamrebekah It might be that when using 2 datasets, your total image size is not a multiple of the total batch size any more.

P.S.: I have not tested my fix for it yet though, so this is just a hunch for now.

prigoyal commented 2 years ago

@iseessel , will you be able to take a look at this ? :)

iseessel commented 2 years ago

Hi @miriamrebekah It should be supported. While I investigate, as a temporary solution, would you be able to create one filelist and entry in your dataset_catalog that has both datasets? It should be as simple as concatenating the two filelists, saving it as a .npy file and creating a new entry in your dataset_catalog.

@Pedrexus Are you using config.DATA.TRAIN.DROP_LAST=True like in simclr_8node_resnet.yaml ?

Pedrexus commented 2 years ago

Hi @miriamrebekah It should be supported. While I investigate, as a temporary solution, would you be able to create one filelist and entry in your dataset_catalog that has both datasets? It should be as simple as concatenating the two filelists, saving it as a .npy file and creating a new entry in your dataset_catalog.

@Pedrexus Are you using config.DATA.TRAIN.DROP_LAST=True like in simclr_8node_resnet.yaml ?

Hello @iseessel.

Yes, I have DROP_LAST set to true, but it keeps giving me the same error. For now, I manually removed the extra images from the .npy file and it solved it.

Here is a sample of the config I've been testing with

```python INFO 2021-11-24 18:53:50,382 hydra_config.py: 141: {'CHECKPOINT': {'APPEND_DISTR_RUN_ID': False, 'AUTO_RESUME': True, 'BACKEND': 'disk', 'CHECKPOINT_FREQUENCY': 1, 'CHECKPOINT_ITER_FREQUENCY': -1, 'DIR': '/home/pvaloi/work/checkpoints/ssl_test', 'LATEST_CHECKPOINT_RESUME_FILE_NUM': 1, 'OVERWRITE_EXISTING': False, 'USE_SYMLINK_CHECKPOINT_FOR_RESUME': False}, 'CLUSTERFIT': {'CLUSTER_BACKEND': 'faiss', 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'SEED': 0}, 'FEATURES': {'DATASET_NAME': '', 'DATA_PARTITION': 'TRAIN', 'DIMENSIONALITY_REDUCTION': 0, 'EXTRACT': False, 'LAYER_NAME': '', 'PATH': '.', 'TEST_PARTITION': 'TEST'}, 'NUM_CLUSTERS': 16000, 'NUM_ITER': 50, 'OUTPUT_DIR': '.'}, 'DATA': {'DDP_BUCKET_CAP_MB': 25, 'ENABLE_ASYNC_GPU_COPY': True, 'NUM_DATALOADER_WORKERS': 16, 'PIN_MEMORY': True, 'TEST': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 64, 'COLLATE_FUNCTION': 'simclr_collator', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '/tmp/ssl', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['places.min'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': ['disk_filelist'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': False, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'name': 'ImgReplicatePil', 'num_times': 2}, {'name': 'RandomResizedCrop', 'size': 224}, {'name': 'RandomHorizontalFlip', 'p': 0.5}, {'name': 'ImgPilColorDistortion', 'strength': 1.0}, {'name': 'ImgPilGaussianBlur', 'p': 0.5, 'radius_max': 2.0, 'radius_min': 0.1}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}, 'TRAIN': {'BASE_DATASET': 'generic_ssl', 'BATCHSIZE_PER_REPLICA': 64, 'COLLATE_FUNCTION': 'simclr_collator', 'COLLATE_FUNCTION_PARAMS': {}, 'COPY_DESTINATION_DIR': '/tmp/ssl', 'COPY_TO_LOCAL_DISK': False, 'DATASET_NAMES': ['places.min'], 'DATA_LIMIT': -1, 'DATA_LIMIT_SAMPLING': {'IS_BALANCED': False, 'SEED': 0, 'SKIP_NUM_SAMPLES': 0}, 'DATA_PATHS': [], 'DATA_SOURCES': ['disk_filelist'], 'DEFAULT_GRAY_IMG_SIZE': 224, 'DROP_LAST': False, 'ENABLE_QUEUE_DATASET': False, 'INPUT_KEY_NAMES': ['data'], 'LABEL_PATHS': [], 'LABEL_SOURCES': [], 'LABEL_TYPE': 'sample_index', 'MMAP_MODE': False, 'NEW_IMG_PATH_PREFIX': '', 'RANDOM_SYNTHETIC_IMAGES': False, 'REMOVE_IMG_PATH_PREFIX': '', 'TARGET_KEY_NAMES': ['label'], 'TRANSFORMS': [{'name': 'ImgReplicatePil', 'num_times': 2}, {'name': 'RandomResizedCrop', 'size': 224}, {'name': 'RandomHorizontalFlip', 'p': 0.5}, {'name': 'ImgPilColorDistortion', 'strength': 1.0}, {'name': 'ImgPilGaussianBlur', 'p': 0.5, 'radius_max': 2.0, 'radius_min': 0.1}, {'name': 'ToTensor'}, {'mean': [0.485, 0.456, 0.406], 'name': 'Normalize', 'std': [0.229, 0.224, 0.225]}], 'USE_DEBUGGING_SAMPLER': False, 'USE_STATEFUL_DISTRIBUTED_SAMPLER': False}}, 'DISTRIBUTED': {'BACKEND': 'nccl', 'BROADCAST_BUFFERS': True, 'INIT_METHOD': 'tcp', 'MANUAL_GRADIENT_REDUCTION': True, 'NCCL_DEBUG': False, 'NCCL_SOCKET_NTHREADS': '', 'NUM_NODES': 1, 'NUM_PROC_PER_NODE': 4, 'RUN_ID': 'auto'}, 'EXTRA': {'DATASET': 'places.min', 'MODEL': None}, 'EXTRACT_FEATURES': {'CHUNK_THRESHOLD': 0, 'OUTPUT_DIR': ''}, 'HOOKS': {'CHECK_NAN': True, 'LOG_GPU_STATS': True, 'MEMORY_SUMMARY': {'DUMP_MEMORY_ON_EXCEPTION': False, 'LOG_ITERATION_NUM': 0, 'PRINT_MEMORY_SUMMARY': True}, 'MODEL_COMPLEXITY': {'COMPUTE_COMPLEXITY': False, 'INPUT_SHAPE': [3, 224, 224]}, 'PERF_STATS': {'MONITOR_PERF_STATS': True, 'PERF_STAT_FREQUENCY': 10, 'ROLLING_BTIME_FREQ': 5}, 'TENSORBOARD_SETUP': {'EXPERIMENT_LOG_DIR': 'tensorboard', 'FLUSH_EVERY_N_MIN': 30, 'LOG_DIR': '.', 'LOG_PARAMS': True, 'LOG_PARAMS_EVERY_N_ITERS': 100, 'LOG_PARAMS_GRADIENTS': True, 'USE_TENSORBOARD': False}}, 'IMG_RETRIEVAL': {'CROP_QUERY_ROI': False, 'DATASET_PATH': '', 'DEBUG_MODE': False, 'EVAL_BINARY_PATH': '', 'EVAL_DATASET_NAME': 'Paris', 'FEATS_PROCESSING_TYPE': '', 'GEM_POOL_POWER': 4.0, 'IMG_SCALINGS': [1], 'NORMALIZE_FEATURES': True, 'NUM_DATABASE_SAMPLES': -1, 'NUM_QUERY_SAMPLES': -1, 'NUM_TRAINING_SAMPLES': -1, 'N_PCA': 512, 'RESIZE_IMG': 1024, 'SAVE_FEATURES': False, 'SAVE_RETRIEVAL_RANKINGS_SCORES': True, 'SIMILARITY_MEASURE': 'cosine_similarity', 'SPATIAL_LEVELS': 3, 'TRAIN_DATASET_NAME': 'Oxford', 'TRAIN_PCA_WHITENING': True, 'USE_DISTRACTORS': False, 'WHITEN_IMG_LIST': ''}, 'LOG_FREQUENCY': 100, 'LOSS': {'CrossEntropyLoss': {'ignore_index': -1}, 'barlow_twins_loss': {'embedding_dim': 8192, 'lambda_': 0.0051, 'scale_loss': 0.024}, 'bce_logits_multiple_output_single_target': {'normalize_output': False, 'reduction': 'none', 'world_size': 1}, 'cross_entropy_multiple_output_single_target': {'ignore_index': -1, 'normalize_output': False, 'reduction': 'mean', 'temperature': 1.0, 'weight': None}, 'deepclusterv2_loss': {'BATCHSIZE_PER_REPLICA': 256, 'DROP_LAST': True, 'kmeans_iters': 10, 'memory_params': {'crops_for_mb': [0], 'embedding_dim': 128}, 'num_clusters': [3000, 3000, 3000], 'num_crops': 2, 'num_train_samples': -1, 'temperature': 0.1}, 'dino_loss': {'crops_for_teacher': [0, 1], 'ema_center': 0.9, 'momentum': 0.996, 'normalize_last_layer': True, 'output_dim': 65536, 'student_temp': 0.1, 'teacher_temp_max': 0.07, 'teacher_temp_min': 0.04, 'teacher_temp_warmup_iters': 37500}, 'moco_loss': {'embedding_dim': 128, 'momentum': 0.999, 'queue_size': 65536, 'temperature': 0.2}, 'multicrop_simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 4096, 'embedding_dim': 128, 'world_size': 64}, 'num_crops': 2, 'temperature': 0.1}, 'name': 'simclr_info_nce_loss', 'nce_loss_with_memory': {'loss_type': 'nce', 'loss_weights': [1.0], 'memory_params': {'embedding_dim': 128, 'memory_size': -1, 'momentum': 0.5, 'norm_init': True, 'update_mem_on_forward': True}, 'negative_sampling_params': {'num_negatives': 16000, 'type': 'random'}, 'norm_constant': -1, 'norm_embedding': True, 'num_train_samples': -1, 'temperature': 0.07, 'update_mem_with_emb_index': -100}, 'simclr_info_nce_loss': {'buffer_params': {'effective_batch_size': 512, 'embedding_dim': 128, 'world_size': 4}, 'temperature': 0.1}, 'swav_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'output_dir': '.', 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temp_hard_assignment_iters': 0, 'temperature': 0.1, 'use_double_precision': False}, 'swav_momentum_loss': {'crops_for_assign': [0, 1], 'embedding_dim': 128, 'epsilon': 0.05, 'momentum': 0.99, 'momentum_eval_mode_iter_start': 0, 'normalize_last_layer': True, 'num_crops': 2, 'num_iters': 3, 'num_prototypes': [3000], 'queue': {'local_queue_length': 0, 'queue_length': 0, 'start_iter': 0}, 'temperature': 0.1, 'use_double_precision': False}}, 'MACHINE': {'DEVICE': 'gpu'}, 'METERS': {'accuracy_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'enable_training_meter': True, 'mean_ap_list_meter': {'max_cpu_capacity': -1, 'meter_names': [], 'num_classes': 9605, 'num_meters': 1}, 'model_output_mask': False, 'name': '', 'names': [], 'precision_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}, 'recall_at_k_list_meter': {'meter_names': [], 'num_meters': 1, 'topk_values': [1]}}, 'MODEL': {'ACTIVATION_CHECKPOINTING': {'NUM_ACTIVATION_CHECKPOINTING_SPLITS': 2, 'USE_ACTIVATION_CHECKPOINTING': True}, 'AMP_PARAMS': {'AMP_ARGS': {'opt_level': 'O2'}, 'AMP_TYPE': 'apex', 'USE_AMP': True}, 'BASE_MODEL_NAME': 'multi_input_output_model', 'CUDA_CACHE': {'CLEAR_CUDA_CACHE': False, 'CLEAR_FREQ': 100}, 'FEATURE_EVAL_SETTINGS': {'EVAL_MODE_ON': False, 'EVAL_TRUNK_AND_HEAD': False, 'EXTRACT_TRUNK_FEATURES_ONLY': False, 'FREEZE_TRUNK_AND_HEAD': False, 'FREEZE_TRUNK_ONLY': False, 'LINEAR_EVAL_FEAT_POOL_OPS_MAP': [], 'SHOULD_FLATTEN_FEATS': True}, 'FSDP_CONFIG': {'AUTO_WRAP_THRESHOLD': 0, 'bucket_cap_mb': 0, 'clear_autocast_cache': True, 'compute_dtype': torch.float32, 'flatten_parameters': True, 'fp32_reduce_scatter': False, 'mixed_precision': True, 'verbose': True}, 'GRAD_CLIP': {'MAX_NORM': 1, 'NORM_TYPE': 2, 'USE_GRAD_CLIP': False}, 'HEAD': {'BATCHNORM_EPS': 1e-05, 'BATCHNORM_MOMENTUM': 0.1, 'PARAMS': [['mlp', {'dims': [2048, 128], 'skip_last_layer_relu_bn': False, 'use_relu': True}]], 'PARAMS_MULTIPLIER': 1.0}, 'INPUT_TYPE': 'rgb', 'MULTI_INPUT_HEAD_MAPPING': [], 'NON_TRAINABLE_PARAMS': [], 'SHARDED_DDP_SETUP': {'USE_SDP': False, 'reduce_buffer_size': -1}, 'SINGLE_PASS_EVERY_CROP': False, 'SYNC_BN_CONFIG': {'CONVERT_BN_TO_SYNC_BN': True, 'GROUP_SIZE': -1, 'SYNC_BN_TYPE': 'apex'}, 'TEMP_FROZEN_PARAMS_ITER_MAP': [], 'TRUNK': {'CONVIT': {'CLASS_TOKEN_IN_LOCAL_LAYERS': False, 'LOCALITY_DIM': 10, 'LOCALITY_STRENGTH': 1.0, 'N_GPSA_LAYERS': 10, 'USE_LOCAL_INIT': True}, 'EFFICIENT_NETS': {}, 'NAME': 'resnet', 'REGNET': {}, 'RESNETS': {'DEPTH': 50, 'GROUPNORM_GROUPS': 32, 'GROUPS': 1, 'LAYER4_STRIDE': 2, 'NORM': 'BatchNorm', 'STANDARDIZE_CONVOLUTIONS': False, 'WIDTH_MULTIPLIER': 1, 'WIDTH_PER_GROUP': 64, 'ZERO_INIT_RESIDUAL': False}, 'VISION_TRANSFORMERS': {'ATTENTION_DROPOUT_RATE': 0, 'CLASSIFIER': 'token', 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0, 'HIDDEN_DIM': 768, 'IMAGE_SIZE': 224, 'MLP_DIM': 3072, 'NUM_HEADS': 12, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': False, 'QK_SCALE': False, 'name': None}, 'XCIT': {'ATTENTION_DROPOUT_RATE': 0, 'DROPOUT_RATE': 0, 'DROP_PATH_RATE': 0.05, 'ETA': 1, 'HIDDEN_DIM': 384, 'IMAGE_SIZE': 224, 'NUM_HEADS': 8, 'NUM_LAYERS': 12, 'PATCH_SIZE': 16, 'QKV_BIAS': True, 'QK_SCALE': False, 'TOKENS_NORM': True, 'name': None}}, 'WEIGHTS_INIT': {'APPEND_PREFIX': '', 'PARAMS_FILE': '', 'REMOVE_PREFIX': '', 'SKIP_LAYERS': ['num_batches_tracked'], 'STATE_DICT_KEY_NAME': 'classy_state_dict'}, '_MODEL_INIT_SEED': 0}, 'MONITORING': {'MONITOR_ACTIVATION_STATISTICS': 0}, 'MULTI_PROCESSING_METHOD': 'forkserver', 'NEAREST_NEIGHBOR': {'L2_NORM_FEATS': False, 'SIGMA': 0.1, 'TOPK': 200}, 'OPTIMIZER': {'betas': [0.9, 0.999], 'construct_single_param_group_only': False, 'head_optimizer_params': {'use_different_lr': False, 'use_different_wd': False, 'weight_decay': 1e-06}, 'larc_config': {'clip': True, 'eps': 1e-08, 'trust_coefficient': 0.001}, 'momentum': 0.9, 'name': 'sgd', 'nesterov': False, 'non_regularized_parameters': [], 'num_epochs': 10, 'param_schedulers': {'lr': {'auto_lr_scaling': {'auto_scale': True, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0003, 'interval_scaling': ['rescaled', 'rescaled'], 'is_adaptive': True, 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'schedulers': [{'end_value': 0.0003, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 0.3, 'wave_type': 'full'}], 'start_value': 0.3, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001], 'wave_type': 'full'}, 'lr_head': {'auto_lr_scaling': {'auto_scale': True, 'base_lr_batch_size': 256, 'base_value': 0.3, 'scaling_type': 'linear'}, 'end_value': 0.0003, 'interval_scaling': ['rescaled', 'rescaled'], 'is_adaptive': True, 'lengths': [0.1, 0.9], 'milestones': [30, 60], 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'schedulers': [{'end_value': 0.0003, 'is_adaptive': True, 'name': 'cosine_warm_restart', 'restart_interval_length': 0.334, 'start_value': 0.3, 'wave_type': 'full'}], 'start_value': 0.3, 'update_interval': 'step', 'value': 0.1, 'values': [0.1, 0.01, 0.001], 'wave_type': 'full'}}, 'regularize_bias': True, 'regularize_bn': True, 'use_larc': True, 'use_zero': False, 'weight_decay': 1e-06}, 'PROFILING': {'MEMORY_PROFILING': {'TRACK_BY_LAYER_MEMORY': False}, 'NUM_ITERATIONS': 10, 'OUTPUT_FOLDER': '.', 'PROFILED_RANKS': [0, 1], 'RUNTIME_PROFILING': {'LEGACY_PROFILER': False, 'PROFILE_CPU': True, 'PROFILE_GPU': True, 'USE_PROFILER': False}, 'START_ITERATION': 0, 'STOP_TRAINING_AFTER_PROFILING': False, 'WARMUP_ITERATIONS': 0}, 'REPRODUCIBILITY': {'CUDDN_DETERMINISTIC': False}, 'SEED_VALUE': 0, 'SLURM': {'ADDITIONAL_PARAMETERS': {}, 'COMMENT': 'vissl job', 'CONSTRAINT': '', 'LOG_FOLDER': '.', 'MEM_GB': 250, 'NAME': 'vissl', 'NUM_CPU_PER_PROC': 8, 'PARTITION': '', 'PORT_ID': 40050, 'TIME_HOURS': 72, 'TIME_MINUTES': 0, 'USE_SLURM': False}, 'SVM': {'cls_list': [], 'costs': {'base': -1.0, 'costs_list': [0.1, 0.01], 'power_range': [4, 20]}, 'cross_val_folds': 3, 'dual': True, 'force_retrain': False, 'loss': 'squared_hinge', 'low_shot': {'dataset_name': 'voc', 'k_values': [1, 2, 4, 8, 16, 32, 64, 96], 'sample_inds': [1, 2, 3, 4, 5]}, 'max_iter': 2000, 'normalize': True, 'penalty': 'l2'}, 'TEST_EVERY_NUM_EPOCH': 1, 'TEST_MODEL': True, 'TEST_ONLY': False, 'TRAINER': {'TASK_NAME': 'self_supervision_task', 'TRAIN_STEP_NAME': 'standard_train_step'}, 'VERBOSE': True} ```

iseessel commented 2 years ago

@Pedrexus Is this the same config you had those errors with -- as this has DATA.TRAIN.DROP_LAST=False and DATA.TEST.DROP_LAST=False

Pedrexus commented 2 years ago

Indeed I tested with a different config and forgot to update the right one. I just retried with DATA.TRAIN.DROP_LAST=True and DATA.TEST.DROP_LAST=True and it worked fine! No error!

Thanks for all your help!

miriamrebekah commented 2 years ago

Hi @miriamrebekah It should be supported. While I investigate, as a temporary solution, would you be able to create one filelist and entry in your dataset_catalog that has both datasets? It should be as simple as concatenating the two filelists, saving it as a .npy file and creating a new entry in your dataset_catalog.

@Pedrexus Are you using config.DATA.TRAIN.DROP_LAST=True like in simclr_8node_resnet.yaml ?

Yes, I am doing this as a workaround! Thanks!

facebookresearch / vissl

training on multiple datasets at once breaks #471