dssg / triage

General Purpose Risk Modeling and Prediction Toolkit for Policy and Social Good Problems
Other
187 stars 61 forks source link

Specification for feature_group_strategies is not working with leave-one-out or leave-one-in #950

Open ElenaVillano opened 2 months ago

ElenaVillano commented 2 months ago

Hi everyone,

I'm running triage over [Red Hat 11.3.1-4] on Linux, Python 3.10.6, and using the v8 triage version. My database is in PostgreSQL 15.7 on x86_64-pc-linux-gnu, compiled by GCC (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit.

Configuration details:

config_version: 'v8'

random_seed: 1472385

temporal_config:
    feature_start_time: '2021-11-01'
    feature_end_time: '2022-12-31'

    label_start_time: '2021-11-01'
    label_end_time: '2022-12-31'

    model_update_frequency: '1month' # ventanas

    max_training_histories: '6month' # periodo de entrenamiento    
    training_label_timespans: ['4d'] # tiempo en que puede suceder la etiqueta 
    training_as_of_date_frequencies: '1d' # cada cuando tomas la decision

    test_durations: '1week'  # cuanto tiempo usarás ese modelo
    test_label_timespans: ['4d']
    test_as_of_date_frequencies: '1d' 

cohort_config: # Cohorte = Contenedores que llegaran a la terminal el siguiente día a partir del eta
    filepath: 'triage/sql/cohorts/cohorte_antes_de_arribo.sql'       
    name: 'arribo_buque'

label_config:  # Etiqueta = Si el contenedor saldrá entre 2 y 4 días
    filepath: 'triage/sql/labels/label_2_4_dias_estadia.sql'
    name: 'e2_4_dias'

feature_aggregations:
  -
    prefix: 'ecvr' # variables sencillas
    from_obj: 'ontology.entities' 
    knowledge_date_column: 'fecha_eta'

    aggregates_imputation:
        all:
          type: 'mean'

    aggregates:
      - # peso_neto
        quantity: 'peso_neto'
        metrics:
          - 'max'
      - # peso_bruto
        quantity: 'peso_bruto'
        metrics:
          - 'max'

    categoricals_imputation:
      all:
        type: 'null_category' 

    categoricals:
      - # dimension
        column: 'dimension'
        metrics:
          - 'sum' 
        choices: ['20','40','45']
      - # ruta_linea_naviera
        column: 'ruta_linea_naviera'
        metrics:
          - 'sum' 
        choice_query: 'select distinct ruta_linea_naviera from ontology.entities'

    intervals: ['all']

  -
    prefix: 'mercha' # variables de mercancia
    from_obj: 'ontology.comportamiento'
    knowledge_date_column: 'fecha_eta'

    categoricals_imputation:
        all:
          type: 'null_category'

    categoricals:
      - # capitulo
        column: 'capitulo'
        metrics:
          - 'sum'
        choice_query: 'select distinct capitulo from ontology.comportamiento'
      - # seccion
        column: 'seccion' 
        metrics:
          - 'sum'
        choice_query: 'select distinct seccion from ontology.comportamiento'

    aggregates_imputation:
        all:
          type: 'mean'

    aggregates:
      - # conteo_capitulo_2sem
        quantity: 
          ccap2s: 'conteo_capitulo_2sem'
        metrics:
          - 'min'
      - # conteo_capitulo_4sem
        quantity: 
          ccap4s: 'conteo_capitulo_4sem'
        metrics:
          - 'min'

    intervals: ['all']

  -
    prefix: 'consig' # variables de consignatario
    from_obj: 'ontology.comportamiento'
    knowledge_date_column: 'fecha_eta'

    categoricals_imputation:
      all:
        type: 'null_category' 

    categoricals:
      - # consignatario top10
        column: 'consignatario'
        metrics:
          - 'sum' 
        choice_query: 'with top50 as(select consignatario, count(consignatario) from ontology.comportamiento group by consignatario order by 2 desc limit 100) select consignatario from top50'

    aggregates_imputation:
        all:
          type: 'mean'

    aggregates:
      - # conteo_consig_2sem
        quantity: 
          ccons2s: 'conteo_consig_2sem'
        metrics:
          - 'min'
      - # conteo_consig_4sem
        quantity: 
          ccons4s: 'conteo_consig_4sem'
        metrics:
          - 'min'

    intervals: ['all']

  -
    prefix: 'liru' # variables de linea y ruta contenedores
    from_obj: 'ontology.comportamiento'
    knowledge_date_column: 'fecha_eta'

    aggregates_imputation:
        all:
          type: 'mean'

    aggregates:
      - # conteo_ruta_2sem
        quantity: 
          crut2s: 'conteo_ruta_2sem'
        metrics:
          - 'min'
      - # conteo_ruta_4sem
        quantity: 
          crut4s: 'conteo_ruta_4sem'
        metrics:
          - 'min'

    intervals: ['all']

## all, leave-one-out, leave-one-in, all-combinations
feature_group_strategies: ['leave-one-out']
#feature_group_strategies: ['all-combinations']

grid_config:
  'sklearn.tree.DecisionTreeClassifier':
        criterion: ['gini']
        max_depth: [5,10,~] 
        min_samples_split: [10,50,100] 
  'sklearn.ensemble.RandomForestClassifier':
        n_estimators: [200,300]
        criterion: ['gini']
        max_depth: [5,10]
        max_features: ['sqrt']
        min_samples_split: [10,50]
  'triage.component.catwalk.estimators.classifiers.ScaledLogisticRegression':
        penalty: ['l1','l2']
        C: [0.01, 0.1, 1.0, 10]
  'sklearn.dummy.DummyClassifier':
        strategy: ['stratified']
  'sklearn.ensemble.ExtraTreesClassifier':
        n_estimators: [500]
        criterion: ['gini']
        max_depth: [5,10]
        max_features: ['sqrt']
        min_samples_split: [50,100]
  'triage.component.catwalk.baselines.rankers.BaselineRankMultiFeature':
        rules:
            - [{feature: 'ecvr_entity_id_all_peso_neto_max', low_value_high_score: False}]

scoring:
    testing_metric_groups:
       -
          metrics: [precision@, recall@]
          thresholds:
            percentiles: [10, 20, 25, 30]
            top_n: [1000, 1400, 1750, 2100]

    training_metric_groups:
       -
          metrics: [precision@, recall@]
          thresholds:
            percentiles: [10, 20, 25, 30]
            top_n: [1000, 1400, 1750, 2100]

All the presented code worked fine until I used the feature_group_strategies in leave-one-out or leave-one-in. In both cases, I get the same error (detailed below). However, when I use feature_group_strategies: ['all-combinations'], it works, but it doesn't group the variables as expected, and I get results as if I were using all.

Command used:

triage experiment triage/experimentos_arribo/e2_ --n-db-processes 3 --n-processes 8 --no-validate --no-save-predictions

Everything runs smoothly until the matrix building step, where I encounter this error:

2024-09-08 15:17:14 - ERROR Child error
Traceback (most recent call last):
File "/Ccd/-pyenv/versions/tri-hp/lib/python3.10/site-packages/triage/experiments/multicore.py", line 166, in run_task_with_splatted_arguments return task_runner(**task)
File "/Ccd/pyenv/versions/tri-hp/lib/python3.10/site-packages/triage/component/architect/builders.py", line 321, in build_matrix
output, labels = self.stitch_csvs(feature_queries, label_query, matrix_store, matrix_uuid)
File "/Ccd/pyenv/versions/tri-hp/lib/python3.10/site-packages/triage/component/architect/builders.py", line 551, in stitch_csvs
if len(df_pl.get_column('as_of_date').head(1)[0].split)) > 1:
File "/Ccd/.pyenv/versions/tri-hp/lib/python3.10/site-packages/polars/dataframe/frame.py", line 6128, in get_column return self[name]
exceptions.ColumnNotFoundError: as_of_date

It seems like the as_of_date column is missing or not properly generated during matrix building, specifically when using the leave-one-out or leave-one-in strategies.

I expected the leave-one-out strategy to group variables accordingly and generate matrices without this error, but instead, the process halts when it reaches matrix building. I checked the matrices generated in the process and confirmed that the as_of_date column is indeed present.

My questions would be:

Any guidance or suggestions would be greatly appreciated!

Thank you for your help.

nanounanue commented 2 months ago

Adding to this all-combinations is not working anymore, it just runs all ...