ecmwf / anemoi-training

Apache License 2.0
17 stars 16 forks source link

datashader does not handle well nans in ocean plots #162

Open lzampier opened 1 week ago

lzampier commented 1 week ago

What happened?

When using the new default configuration in anemoi-training, which has datashader: True, the colorbar bounds in the callback plots are wrongly set. This is seemingly due to the presence of nans in the field.

Here is an example of the wrong plots: Screenshot 2024-11-25 at 14 29 05

The result is ok without datashader.

What are the steps to reproduce the bug?

Here is an ORAS6-based config for anemoi-training that can be used to reproduce the issue. Please ask @lzampier if you need more details.

# anemoi-training: develop
# anemoi-models: feature/mask-bounding-dependent-ice-variables

defaults:
- data: mod_oce_for_atm
- dataloader: native_grid
- diagnostics: evaluation
- hardware: atos
- graph: encoder_decoder_only 
- model: transformer 
- training: default
- _self_

### This file is for local experimentation.
##  When you commit your changes, assign the new features and keywords
##  to the correct defaults.
# For example to change from default GPU count:
# hardware:
#   num_gpus_per_node: 1

data:
  resolution: o96
  normalizer:
    min-max: [avg_sivol, avg_siconc, avg_icesalt, avg_sialb, avg_siue, avg_sivn, avg_snvol]
    max:
    none:
    - cos_latitude
    - sin_latitude
    - cos_longitude
    - sin_longitude
    - cos_solar_zenith_angle
    - cos_julian_day
    - cos_local_time
    - sin_julian_day
    - sin_local_time
  frequency: 6h
  timestep: 24h
  diagnostic:
  forcing:
  - cos_latitude
  - sin_latitude
  - cos_longitude
  - sin_longitude
  - cos_solar_zenith_angle
  - cos_julian_day
  - cos_local_time
  - sin_julian_day
  - sin_local_time
  - 10u
  - 10v
  - 2t
  - 2d
  - ssrd
  - strd
  - tp
  - msl
  - lsm
  imputer:
    mean:
      - avg_zos
      - avg_tos
      - avg_sos
      - avg_svn
      - avg_sve
  const_imputer:
    0:
      - avg_sivol
      - avg_siconc
      - avg_icesalt
      - avg_sialb
      - avg_siue
      - avg_sivn
      - avg_snvol

hardware:
  paths:
    data: /home/mlx/ai-ml/datasets/
  files:
    dataset_atm: aifs-ea-an-oper-0001-mars-${data.resolution}-1979-2023-6h-v7.zarr
    dataset_oce: aifs-o6-tpa-ocda-0001-mars-${data.resolution}-2005-2023-6h-v2-ocean-surface-sea-ice.zarr
diagnostics:
  log:
    mlflow:
      enabled: True
      offline: False
      authentication: True
      experiment_name: 'coupled-ocean-atmos'
      run_name: 'mod:oce - for:atm - 24h - 2005-2021'

model:
  num_channels: 256
  bounding: #These are applied in order
    - _target_: anemoi.models.layers.bounding.ReluBounding #[0, infinity)
      variables:
        - avg_sivol
        - avg_snvol
        - avg_icesalt
    - _target_: anemoi.models.layers.bounding.HardtanhBounding #[0, 1]
      variables:
        - avg_siconc
        - avg_sialb
      min_val: 0
      max_val: 1

dataloader:
  limit_batches:
    training: 300
    validation: 300
  dataset: ${hardware.paths.data}/${hardware.files.dataset_oce}
  training:
    dataset:
    - dataset: ${hardware.paths.data}/${hardware.files.dataset_oce}
      start: 2005-01-03
      end: 2021
      select: [avg_svn, avg_sve, avg_siue, avg_sivn, avg_sivol, avg_snvol, avg_siconc, avg_icesalt, avg_sialb, avg_tos, avg_sos, avg_zos, cos_latitude, sin_latitude, cos_longitude, sin_longitude, cos_solar_zenith_angle, cos_julian_day, cos_local_time, sin_julian_day, sin_local_time, lsm]
    - dataset: ${hardware.paths.data}/${hardware.files.dataset_atm}
      start: 2005-01-03
      end: 2021
      select: [10u, 10v, 2t, 2d, ssrd, strd, tp, msl]
    start: 2005-01-03
    end: 2021

  validation:
    dataset:
    - dataset: ${hardware.paths.data}/${hardware.files.dataset_oce}
      start: 2022
      end: 2022
      select: [avg_svn, avg_sve, avg_siue, avg_sivn, avg_sivol, avg_snvol, avg_siconc, avg_icesalt, avg_sialb, avg_tos, avg_sos, avg_zos, cos_latitude, sin_latitude, cos_longitude, sin_longitude, cos_solar_zenith_angle, cos_julian_day, cos_local_time, sin_julian_day, sin_local_time, lsm]
    - dataset: ${hardware.paths.data}/${hardware.files.dataset_atm}
      start: 2022
      end: 2022
      select: [10u, 10v, 2t, 2d, ssrd, strd, tp, msl]
    start: 2022
    end: 2022

  test:
    dataset:
    - dataset: ${hardware.paths.data}/${hardware.files.dataset_oce}
      start: 2023
      end: 2023-12-28
      select: [avg_svn, avg_sve, avg_siue, avg_sivn, avg_sivol, avg_snvol, avg_siconc, avg_icesalt, avg_sialb, avg_tos, avg_sos, avg_zos, cos_latitude, sin_latitude, cos_longitude, sin_longitude, cos_solar_zenith_angle, cos_julian_day, cos_local_time, sin_julian_day, sin_local_time, lsm]
    - dataset: ${hardware.paths.data}/${hardware.files.dataset_atm}
      start: 2023
      end: 2023-12-28
      select: [10u, 10v, 2t, 2d, ssrd, strd, tp, msl]
    start: 2023
    end: 2023

training:
  max_steps: 150000
  lr:
    iterations: 150000 #${training.max_steps}
    min: 3e-7 #Not scaled by #GPU
  variable_loss_scaling:
    default: 1
    sfc:
      avg_svn: 0.5
      avg_sve: 0.5
      avg_siue: 100
      avg_sivn: 100
      avg_sivol: 500
      avg_snvol: 300
      avg_siconc: 200
      avg_icesalt: 30
      avg_sialb: 30
      avg_tos: 100
      avg_sos: 10
      avg_zos: 10
  metrics:
      - avg_zos
      - avg_tos
      - avg_sivol
      - avg_siconc
      - avg_sos
      - avg_svn
      - avg_sve
      - avg_icesalt
      - avg_sialb
      - avg_siue
      - avg_sivn
      - avg_snvol
  # rollout:
  #   epoch_increment: 1
  #   max: 4

Version

current anemoi-training develop (25-11-2024)

Platform (OS and architecture)

atos

Relevant log output

No response

Accompanying data

No response

Organisation

No response

sahahner commented 1 week ago

solved by #152

lzampier commented 10 hours ago

The error with colour bars comes back, this time only for the last two columns of the panel:

gnn_pred_val_sample_rstep00_batch0000_rank0_epoch000