drivendataorg / zamba

A Python package for identifying 42 kinds of animals, training custom models, and estimating distance from camera trap videos
https://zamba.drivendata.org/docs/stable/
MIT License
118 stars 27 forks source link

Validating the species fails if order is different #234

Closed pjbull closed 2 years ago

pjbull commented 2 years ago

validate_species fails if order is different. Reproduced by training a model on a subset of labels (aardvark and blank).

Passed config

train_config:
  auto_lr_find: false
  backbone_finetune_config:
    backbone_initial_ratio_lr: 0.01
    multiplier: 1
    pre_train_bn: true
    train_bn: false
    unfreeze_backbone_at_epoch: 3
    verbose: true
  batch_size: 1
  data_dir: /storage
  early_stopping_config:
    patience: 5
  labels: /storage/inferencejobs/train-job-uuid/train_labels.csv
  model_cache_dir: /root/model_weight_cache
  model_name: time_distributed
  num_workers: 1
  overwrite: true
  save_dir: /storage/inferencejobs/train-job-uuid/train_output/
video_loader_config:
  crop_bottom_pixels: 50
  early_bias: false
  ensure_total_frames: true
  evenly_sample_total_frames: false
  fps: 1.0
  frame_indices: null
  frame_selection_height: null
  frame_selection_width: null
  i_frames: false
  megadetector_lite_config:
    confidence: 0.25
    fill_mode: score_sorted
    image_height: 224
    image_width: 224
    n_frames: 16
    nms_threshold: 0.45
    seed: 55
    sort_by_time: true
  model_input_height: 240
  model_input_width: 426
  pix_fmt: rgb24
  scene_threshold: null
  total_frames: 16

Output logs before error

2022-09-27 05:03:54.066 | INFO     | zamba.models.config:validate_filepaths_and_labels:458 - Validating labels csv.
2022-09-27 05:03:54.075 | INFO     | zamba.models.config:check_files_exist_and_load:114 - Checking all 10 filepaths exist. Trying fast file checking...

QUEUEING TASKS | :   0%|          | 0/10 [00:00<?, ?it/s]
QUEUEING TASKS | : 100%|██████████| 10/10 [00:00<00:00, 3823.43it/s]

PROCESSING TASKS | :   0%|          | 0/10 [00:00<?, ?it/s]
PROCESSING TASKS | : 100%|██████████| 10/10 [00:00<00:00, 9541.18it/s]

COLLECTING RESULTS | :   0%|          | 0/10 [00:00<?, ?it/s]
COLLECTING RESULTS | : 100%|██████████| 10/10 [00:00<00:00, 139810.13it/s]
2022-09-27 05:03:54.083 | INFO     | zamba.models.config:check_files_exist_and_load:152 - Checking that all videos can be loaded. If you're very confident all your videos can be loaded, you can skip this with `skip_load_validation`, but it's not recommended.

  0%|          | 0/10 [00:00<?, ?it/s]
 20%|██        | 2/10 [00:00<00:00, 11.99it/s]
 40%|████      | 4/10 [00:00<00:00, 12.08it/s]
 60%|██████    | 6/10 [00:00<00:00, 12.13it/s]
 80%|████████  | 8/10 [00:00<00:00, 12.10it/s]
100%|██████████| 10/10 [00:00<00:00, 12.18it/s]
100%|██████████| 10/10 [00:00<00:00, 12.13it/s]
2022-09-27 05:03:55.244 | INFO     | zamba.models.config:preprocess_labels:564 - Preprocessing labels into one hot encoded labels with one row per video.
2022-09-27 05:03:55.250 | INFO     | zamba.models.config:make_split:608 - Dividing videos into train, val, and holdout sets using the following split proportions: {'train': 3, 'val': 1, 'holdout': 1}.
2022-09-27 05:03:55.250 | INFO     | zamba.models.config:make_split:620 - No 'site' column found so videos for each species will be randomly allocated across splits using provided split proportions.
2022-09-27 05:03:55.254 | INFO     | zamba.models.config:make_split:651 - train      5
holdout    3
val        2
Name: split, dtype: int64
2022-09-27 05:03:55.254 | INFO     | zamba.models.config:make_split:655 - Writing out split information to /storage/inferencejobs/train-job-uuid/train_output/splits.csv.
The following configuration will be used for training:

    Config file: /storage/inferencejobs/train-job-uuid/train-job-uuid.yml
    Data directory: /storage
    Labels csv: /storage/inferencejobs/train-job-uuid/train_labels.csv
    Species: 
    - aardvark
    - blank
    Model name: time_distributed
    Checkpoint: None
    Batch size: 1
    Number of workers: 1
    GPUs: 0
    Dry run: False
    Save directory: /storage/inferencejobs/train-job-uuid/train_output
    Weight download region: us

Skipping confirmation and proceeding to train.
2022-09-27 05:03:55.263 | INFO     | zamba.models.model_manager:instantiate_model:72 - Instantiating model: TimeDistributedEfficientNet
2022-09-27 05:03:55.265 | INFO     | zamba.models.model_manager:resume_training:161 - Provided species fully overlap with Zamba species. Resuming training from latest checkpoint.
Downloading: "https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-weights/efficientnetv2_rw_m_agc-3d90cb1e.pth" to /root/.cache/torch/hub/checkpoints/efficientnetv2_rw_m_agc-3d90cb1e.pth
2022-09-27 05:04:04.262 | INFO     | zamba.models.model_manager:log_schedulers:184 - Using learning rate scheduler: MultiStepLR
2022-09-27 05:04:04.262 | INFO     | zamba.models.model_manager:log_schedulers:185 - Using scheduler params: {'gamma': 0.5, 'milestones': [3], 'verbose': True}

Error message

ValueError: Dataloader species and model species do not match.

Train dataset includes:
aardvark, blank, antelope_duiker, hyena, leopard, forest_buffalo, rodent, hare_rabbit, giraffe, reptile, hog, mongoose, bird, large_flightless_bird, small_cat, chimpanzee_bonobo, cheetah, equid, lion,
pangolin, fox, human, hippopotamus, badger, monkey_prosimian, elephant, gorilla, cattle, bat, porcupine, civet_genet, wild_dog_jackal

Val dataset includes:
aardvark, blank, antelope_duiker, hyena, leopard, forest_buffalo, rodent, hare_rabbit, giraffe, reptile, hog, mongoose, bird, large_flightless_bird, small_cat, chimpanzee_bonobo, cheetah, equid, lion,
pangolin, fox, human, hippopotamus, badger, monkey_prosimian, elephant, gorilla, cattle, bat, porcupine, civet_genet, wild_dog_jackal

Test dataset includes:
aardvark, blank, antelope_duiker, hyena, leopard, forest_buffalo, rodent, hare_rabbit, giraffe, reptile, hog, mongoose, bird, large_flightless_bird, small_cat, chimpanzee_bonobo, cheetah, equid, lion,
pangolin, fox, human, hippopotamus, badger, monkey_prosimian, elephant, gorilla, cattle, bat, porcupine, civet_genet, wild_dog_jackal

Model predicts:
aardvark, antelope_duiker, badger, bat, bird, blank, cattle, cheetah, chimpanzee_bonobo, civet_genet, elephant, equid, forest_buffalo, fox, giraffe, gorilla, hare_rabbit, hippopotamus, hog, human,
hyena, large_flightless_bird, leopard, lion, mongoose, monkey_prosimian, pangolin, porcupine, reptile, rodent, small_cat, wild_dog_jackal
ejm714 commented 2 years ago

This comes from this line: https://github.com/drivendataorg/zamba/blob/master/zamba/models/model_manager.py#L178

Previously we sorted the columns in place. To avoid that sorting assumption, we set the col order explicitly. But here we're assigning to a new labels object that never gets used. At this point in the code when we're adding more columns, we are working directly with the train_config.labels object, which is admittedly a fragile approach. Not sure the right fix yet.

The error is useful in that it's right -- our dataloaders and model have different orders and that would yield poor results.

ejm714 commented 2 years ago

I think the fix is to do this when we're setting up labels in the configs. This is also conceptually clearer to not be doing any labels modification in instantiate_model. We can set the labels column as a categorical before we one hot encode, and set the categories to be all the models on the species if we're using the default model labels.