Use training data from compressed forcing runs

dsgoll123 commented 1 year ago

Currently: we train the RF on the global simulation (i.e the entry in varlist.json "resp" ). This was done for developing the tool or [debug mode]. Target: We also want to train on a (variable size of) subset of pixels produced by the ORCHIDEE runs using the compressed forcing [production mode]. The data should be done extrapolated on the original global grid.

It would be nice to maintain the current training for debugging/developing the tool.

vbast commented 1 year ago

I suggest to add to the entry "resp" in varlist.json two additional fields:

"format", with two possible values: "global" or "compressed". It indicates the format of the "sourcefile" provided (i.e. if format=global, we train RF on global simulation in debug mode; if format=compressed, we train RF on a subset of pixels in production mode)
"targetfile", it contains a path to a global restart file that will be used to fill variables with predicted data over the globe. In debug mode this field is the same as "sourcefile", so it can be dropped (or one can put a path to a different file there as well). In production mode it should be obligatory, as it defines a correct format for the global restart file.

Else, if format=compressed (i.e. in production mode) the evaluation part will be skipped (as we do not have a real global model run to compare with).

dsgoll123 commented 1 year ago

This very sounds good to me. Please go an make the edits.

The introduction of 'targetfile' provides flexibility in case we will be able to drop the use of a pred-var1 file. Currently the targetfile can be set to the pred-var1 file (as the format are identical, and variables not handled by ML (yet) should be taken from pred-var1).

vbast commented 1 year ago

I add this new functionality and updated the documentation correspondingly. One point : 'resp' is linked to stomate restart file, whereas pred-var1 is linked to stomate history file (at least in online examples). So to my view, resp-targetfile cannot be set to the same file as pred-var1.

dsgoll123 commented 1 year ago

Hi Vlad

I had a go & tried the tool on the CNP version. I did a clean start running task 1,2,3, then performed a ORC simulation with the aligned forcing. When I run task 4 it crashes. An issue with the dimensions, but I am not really understanding what the problem is.

You can find my configuration here (it's the github code downloaded today, and I only modified: job, varlist.json and MLacc.def. /home/surface3/dgoll/SPINUP_ML/20221220/SPINacc-main

Cheers Daniel

On Mon, 19 Dec 2022 at 10:51, Vladislav Bastrikov @.***> wrote:

I add this new functionality and updated the documentation correspondingly. One point : 'resp' is linked to stomate restart file, whereas pred-var1 is linked to stomate history file (at least in online examples). So to my view, resp-targetfile cannot be set to the same file as pred-var1.

— Reply to this email directly, view it on GitHub https://github.com/dsgoll123/SPINacc/issues/19#issuecomment-1357377034, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATCGY3WKSIROQJYKAOGHQ3LWOAV2PANCNFSM6AAAAAARM5I2TI . You are receiving this because you authored the thread.Message ID: @.***>

-- LSCE / UPSaclay / CLAND Orme des Merisiers, 91191 Gif sur Yvette, France +33 169 08 98 16

vbast commented 1 year ago

Hi Daniel,

I think you have reversed the paths, 'sourcefile' should be the one obtained with compressed (aligned) forcing, i.e. it is the 'source' for training; 'targetfile' should be the one obtained in a standard global run, i.e. it is the 'target' for filling with trained/predicted data. Perhaps the terms sourcefile/targetfile should be replaced by some other names to avoid confusion.

For example, it could be: "compressed_restartfile" and "global_restartfile". In debug mode the user can provide only "global_restartfile", in production mode - both. Then the field "format" can be dropped, and the tool will assume that the file provided in "compressed_restartfile" can be only in compressed form. And if no file is provided there => it is running in debug mode.

Vlad

dsgoll123 commented 1 year ago

Thanks Vlad. My bad, now it runs.

On Tue, 20 Dec 2022 at 21:28, Vladislav Bastrikov @.***> wrote:

Hi Daniel,

I think you have reversed the paths, 'sourcefile' should be the one obtained with compressed (aligned) forcing, i.e. it is the 'source' for training; 'targetfile' should be the one obtained in a standard global run, i.e. it is the 'target' for filling with trained/predicted data. Perhaps the terms sourcefile/targetfile should be replaced by some other names to avoid confusion.

For example, it could be: "compressed_restartfile" and "global_restartfile". In debug mode the user can provide only "global_restartfile", in production mode - both. Then the field "format" can be dropped, and the tool will assume that the file provided in "compressed_restartfile" can be only in compressed form. And if no file is provided there => it is running in debug mode.

Vlad

— Reply to this email directly, view it on GitHub https://github.com/dsgoll123/SPINacc/issues/19#issuecomment-1360144790, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATCGY3TQ7FLKCJM25SLQBY3WOIJIRANCNFSM6AAAAAARM5I2TI . You are receiving this because you authored the thread.Message ID: @.***>

-- LSCE / UPSaclay / CLAND Orme des Merisiers, 91191 Gif sur Yvette, France +33 169 08 98 16

dsgoll123 commented 1 year ago

I have tested the workflow 1-4 successfully, including the pixel-level and rerun simulation with ORCHIDEE-CNP v1.3 which read in files produced by the tool.

I will close this issue.

CALIPSO-project / SPINacc

Use training data from compressed forcing runs #19