[WIP] Add datasets to config api (pipelines)

leoromanovich commented 5 months ago

For now, we don't push you to follow any predefined schema of issue, but ensure you've already read our contribution guide: https://open-metric-learning.readthedocs.io/en/latest/from_readme/contributing.html.

[x] Example of making custom dataset builder
[x] Change docs
[x] Remain some configs in old format (to test that nothing is breaking)

leoromanovich commented 5 months ago

@AlekseySh what do you think about implementation like that?

AlekseySh commented 5 months ago

@leoromanovich by the way, we also need to update postprocessing pipeline seems like get_loaders_with_embeddings is already in a format which is very close to what we expect in builders registry, right?

leoromanovich commented 5 months ago

@leoromanovich by the way, we also need to update postprocessing pipeline seems like get_loaders_with_embeddings is already in a format which is very close to what we expect in builders registry, right?

Looks like) I've added some changes, tests passed.)

leoromanovich commented 4 months ago

@AlekseySh Check changes for reranking builder, please. What I don't like about current solution: Because of using feature extractor inside, we need to pass args for extractor (like precision, num_workers, bs_inference). Not critical, but I believe we need to change options, used in different places upper in config, if we decide, that reranking dataset builder approach is good enough.

AlekseySh commented 4 months ago

based on offline discussion:

------------------------
PREDICT.YAML

precision: 32
accelerator: gpu
devices: 1

dataset:
  name: BaseImgDataset
  im_paths: ...  # or im_dir
  transforms_predict:
    name: norm_resize_albu
    args:
      im_size: 224

save_dir: "."

bs: 64
num_workers: 10

extractor:
  name: vit
  args:
    arch: vits16
    normalise_features: False
    use_multi_scale: False
    weights: vits16_cars

hydra:
  run:
    dir: ${save_dir}
  searchpath:
   - pkg://oml.configs
  job:
    chdir: True

-------------------
VALIDATE.YAML

accelerator: gpu
devices: 1
precision: 32

bs_val: 256
num_workers: 8

val_dataset:
  name: image_dataset
  dataframe_name: df_with_bboxes.csv  # df/path_to_df
  args:
    dataset_root: data/CARS196/
    transforms_val:
      name: norm_resize_albu
      args:
        im_size: 224

extractor:
  name: vit
  args:
    arch: vits16
    normalise_features: False
    use_multi_scale: False
    weights: vits16_cars

metric_args:
  metrics_to_exclude_from_visualization: [cmc,]
  cmc_top_k: [1, 5]
  map_top_k: [5]
  precision_top_k: [5]
  fmr_vals: [0.01]
  pcf_variance: [0.5, 0.9, 0.99]
  return_only_overall_category: False
  visualize_only_overall_category: True

hydra:
  searchpath:
   - pkg://oml.configs
  job:
    chdir: True

-----------------------------
REGISTRY.PY

REGISTRY_DATASETS = {
  "image_dataset": ImageQGLDataset
}

def get_dataset_by_cfg(cfg,  split_val=None):
  if split_val and dataframe_name in cfg:
    df = pd.read_csv(cfg["dataframe_name"])
    df = df[df.split == split_val]

  return REGISTRY_DATASETS["image_dataset"](df_path/df)

-------------------------------
PIPELINES.PY

train_dataset = get_dataset_by_cfg(cfg["train_dataset"], split_val="train")
val_dataset = get_dataset_by_cfg(cfg["valid_dataset"], split_val="validate")

leoromanovich commented 4 months ago

Looks like we can't avoid a small builder, because of transforms initialisation and mapping. from this

REGISTRY_DATASETS = {
  "image_dataset": ImageQGLDataset
}

def get_dataset_by_cfg(cfg,  split_val=None):
  if split_val and dataframe_name in cfg:
    df = pd.read_csv(cfg["dataframe_name"])
    df = df[df.split == split_val]

  return REGISTRY_DATASETS["image_dataset"](df_path/df)

to something like that:

REGISTRY_DATASETS = {
  "image_qg_dataset": qg_builder,
}

def qg_builder(args):
    transforms = ...
    dataset = QGDataset(....)
    return dataset

def get_dataset_by_cfg(cfg,  split_val=None):
  if split_val and dataframe_name in cfg:
    df = pd.read_csv(cfg["dataframe_name"])
    mapper = {...}
    df = df[df.split == split_val]
    df = df[].map(mapper)
    cfg['df'] = df  # because not all datasets will have a dataframe, we pack it inside cfg.
  return REGISTRY_DATASETS["image_qg_dataset"](df_path/df)

upd: offline decided to get transforms before dataset initialisation

OML-Team / open-metric-learning

[WIP] Add datasets to config api (pipelines) #585