sub-sampling using correlation distance

AndreaGarciaJuan commented 2 years ago

[x] distance calculation (sklearn.metrics.pairwise.haversine_distances) for all points first
[x] iterations until all points are far enough
[x] use mask
[x] how long it takes?

AndreaGarciaJuan commented 2 years ago

First try 8 min (2 min for distance matrix calculation) Capture d'écran de 2021-10-08 17-58-39

AndreaGarciaJuan commented 2 years ago

def get_regulargrid_dataset(ds, corr_dist, gridplot=False, season='all'):
    '''Re-sampling od the dataset selecting profiles separated the correlation distance

           Parameters
           ----------
               ds: reference profiles dataset
               corr_dist: correlation distance
               gridplot: plot the grid or not

           Returns
           ------
               Re-sampled dataset

               '''
    # create distance matrix
    from sklearn.metrics.pairwise import haversine_distances
    from math import radians
    lats_in_radians = np.array([radians(_) for _ in ds['lat'].values])
    lons_in_radians = np.array([radians(_) for _ in ds['long'].values])
    coords_in_radians = np.column_stack((lats_in_radians, lons_in_radians))
    dist_matrix = haversine_distances(coords_in_radians)
    dist_matrix = dist_matrix * 6371000/1000  # multiply by Earth radius to get kilometers

    # create mask
    mask_s = np.ones(len(ds['n_profiles'].values))

    #loop
    n_iterations = range(len(ds['n_profiles'].values))
    for i in n_iterations:
        random_p = np.random.choice(ds['n_profiles'].values*mask_s, 1, replace=False)
        print('------------')
        print(random_p)

        if random_p == 0:
            print('point is already masked')
            continue

        # points father than corr_dist = 1
        bool_far_points = (np.array(dist_matrix[:,int(random_p[0])])*mask_s > corr_dist)
        print('n point to delate:')
        print(sum(np.logical_not(bool_far_points)))

        # stop condition
        if np.all(bool_far_points):
            print('no more points to delate')
            print(i)
            break

        # change mask
        mask_s = mask_s*bool_far_points
        print('n points in mask')
        print(sum(mask_s))

gmaze commented 2 years ago

Given the limited size of the domain, is the error that large if you switch the haversine distance for a flat Earth approx. Euclidean distance ? Computation should be faster, especially because sklearn.metrics.pairwise_distances can be parallelised

gmaze commented 2 years ago

Also, is the loop long because it has plenty of iterations or because one iteration is rather long ?

AndreaGarciaJuan commented 2 years ago

Second try: correlation distance: 50 km 2 min for computing new dataset with 3625 profiles

Capture d'écran de 2021-10-13 10-25-37

AndreaGarciaJuan commented 2 years ago

Classification Capture d'écran de 2021-10-13 10-51-22

AndreaGarciaJuan commented 2 years ago

Float 3901915: 38018 profiles. I can do sub-sampling using a float32 distance matrix:

dist_matrix = haversine_distances(coords_in_radians).astype(np.float32)

Capture d'écran de 2021-10-14 13-32-02

But when trying to calculate BIC, RAM memory is not enough for distance matrix: Capture d'écran de 2021-10-19 10-29-08

gmaze commented 2 years ago

did you also try np.float16 ?

gmaze commented 2 years ago

dist_matrix = haversine_distances(coords_in_radians).astype(np.float32)
Here you convert type after the computation, what is the type of the output of haversine_distances as a function of coords_in_radians type ? You may need to convert coords_in_radians first

AndreaGarciaJuan commented 2 years ago

When trying:

lats_in_radians = np.array([radians(_) for _ in ds['lat'].values]).astype(np.float32)
lons_in_radians = np.array([radians(_) for _ in ds['long'].values]).astype(np.float32)
coords_in_radians = np.column_stack((lats_in_radians, lons_in_radians))
dist_matrix = haversine_distances(coords_in_radians)

I obtain: Capture d'écran de 2021-10-19 10-55-59 I think the output of haversine_distances is always float64

If I try:

dist_matrix = haversine_distances(coords_in_radians).astype(np.float16)

I obtain: Capture d'écran de 2021-10-19 11-12-10

AndreaGarciaJuan commented 2 years ago

In the new version of the 'get_regulargrid_dataset' function I calculate the distance matrix only for the profiles around the random profile I choose in each iteration. There is no problems with RAM memory because the matrices are little and it do not take a lot of time to compute:

float in Gulf Stream (8793 profiles): 13s float in Southern Ocean (16435 profiles): 30s float in Agulhas Current (18391 profiles): 30s

Example of results: Capture d'écran de 2022-01-12 11-32-16

AndreaGarciaJuan commented 2 years ago

Calculating the BIC with this subsampling method gives "reasonable" computation time (with 10 runs and testing k=[1:15]: float in Gulf Stream (8793 profiles): 4 min float in Southern Ocean (16435 profiles): 10 min float in Agulhas Current (18391 profiles): 9 min

example (Agulhas Current): Capture d'écran de 2022-01-13 11-27-19

euroargodev / DMQC-PCM

sub-sampling using correlation distance #18