Closed AndreaGarciaJuan closed 3 years ago
BIC plot using:
I do not understand why the curve is not going up for big number of classes. Do you have any idea?
Nb of independent samples too large ?
@AndreaGarciaJuan also, here is a code snippet to run a parallel computation:
import concurrent.futures
from tqdm import tqdm
LIST_OF_ARGS = np.arange(0,12) # this is the list of arguments to iterate over, for instance nb of classes for a PCM
def do_this(this_arg):
""" Function to run on a single argument """
# This is where you would run a BIC computation, given a nb of classes
return this_arg**2 # dummy computation for the example
results = []
ConcurrentExecutor = concurrent.futures.ThreadPoolExecutor(max_workers=100)
with ConcurrentExecutor as executor:
future_to_url = {executor.submit(do_this, arg): arg for arg in LIST_OF_ARGS}
futures = concurrent.futures.as_completed(future_to_url)
futures = tqdm(futures, total=len(LIST_OF_ARGS))
for future in futures:
traj = None
try:
traj = future.result()
except Exception as e:
pass
finally:
results.append(traj)
results = [r for r in results if r is not None] # Only keep non-empty results
If you want to pass a dataset (without modifying it) to the function, you can do it this way:
import concurrent.futures
from tqdm import tqdm
SHARED_DATA = 12
LIST_OF_ARGS = np.arange(0,12) # this is the list of arguments to iterate over, for instance nb of classes for a PCM
def do_this(data, this_arg):
""" Function to run on a single argument """
# This is where you would run a BIC computation, given a nb of classes
return data + this_arg**2 # dummy computation for the example
results = []
ConcurrentExecutor = concurrent.futures.ThreadPoolExecutor(max_workers=100)
with ConcurrentExecutor as executor:
future_to_url = {executor.submit(do_this, SHARED_DATA, arg): arg for arg in LIST_OF_ARGS}
futures = concurrent.futures.as_completed(future_to_url)
futures = tqdm(futures, total=len(LIST_OF_ARGS))
for future in futures:
traj = None
try:
traj = future.result()
except Exception as e:
pass
finally:
results.append(traj)
results = [r for r in results if r is not None] # Only keep non-empty results
ok! thank you very much, I will try it tomorrow
Paralelisation works well using multi-thread, also in the VRE. Now I am working on including time correlation input when selecting the sub-dataset and I will create a plot_BIC function in Plotter class to obtain a clean development notebook.
An example of the plot:
I am trying to include time correlation in the BIC calculation. For my dataset in the Mediterranean, when I use the month of December, I should use a spatial correlation of 40km to get a minimum (K=10). If I add another month (June, to choose a summer month) the curve do not show a minimum and I should increase the spatial correlation to 60km to get the fist minimum (k=10). Here you are the figure:
I have tried to use aleatory months spaced of a given number of months but I have never found a minimum in the curve. I feel that adding another time step for BIC calculations increase enormously the correlation. I think the best idea (at least for the beta-version) is to ask the user to chose 2 time steps in the dataset, and tell him that if there is not a clear minimum he should choose other time steps or increase the spatial correlation. What do you think about it?
This problem was solved using a different grid selection for each dataset. The function use now time_steps
as input, where the user can choose the time steps he wants to use for BIC calculation. If time steps are too near in time, a warning appears.
BIC, BIC_min = BIC_calculation(ds=ds, coords_dict=P.coords_dict,
corr_dist=corr_dist, time_steps=time_steps,
pcm_features=pcm_features, features_in_ds=features_in_ds, z_dim=z_dim,
Nrun=Nrun, NK=NK)
Here un examlpe of BIC using time_steps = ['2018-01','2018-07']
MInimun is k=12
Plot BIC in Develop_PCM_model notebook