Closed AndreaGarciaJuan closed 2 years ago
First try 8 min (2 min for distance matrix calculation)
def get_regulargrid_dataset(ds, corr_dist, gridplot=False, season='all'):
'''Re-sampling od the dataset selecting profiles separated the correlation distance
Parameters
----------
ds: reference profiles dataset
corr_dist: correlation distance
gridplot: plot the grid or not
Returns
------
Re-sampled dataset
'''
# create distance matrix
from sklearn.metrics.pairwise import haversine_distances
from math import radians
lats_in_radians = np.array([radians(_) for _ in ds['lat'].values])
lons_in_radians = np.array([radians(_) for _ in ds['long'].values])
coords_in_radians = np.column_stack((lats_in_radians, lons_in_radians))
dist_matrix = haversine_distances(coords_in_radians)
dist_matrix = dist_matrix * 6371000/1000 # multiply by Earth radius to get kilometers
# create mask
mask_s = np.ones(len(ds['n_profiles'].values))
#loop
n_iterations = range(len(ds['n_profiles'].values))
for i in n_iterations:
random_p = np.random.choice(ds['n_profiles'].values*mask_s, 1, replace=False)
print('------------')
print(random_p)
if random_p == 0:
print('point is already masked')
continue
# points father than corr_dist = 1
bool_far_points = (np.array(dist_matrix[:,int(random_p[0])])*mask_s > corr_dist)
print('n point to delate:')
print(sum(np.logical_not(bool_far_points)))
# stop condition
if np.all(bool_far_points):
print('no more points to delate')
print(i)
break
# change mask
mask_s = mask_s*bool_far_points
print('n points in mask')
print(sum(mask_s))
Given the limited size of the domain, is the error that large if you switch the haversine distance for a flat Earth approx. Euclidean distance ? Computation should be faster, especially because sklearn.metrics.pairwise_distances can be parallelised
Also, is the loop long because it has plenty of iterations or because one iteration is rather long ?
Second try: correlation distance: 50 km 2 min for computing new dataset with 3625 profiles
Classification
Float 3901915: 38018 profiles. I can do sub-sampling using a float32 distance matrix:
dist_matrix = haversine_distances(coords_in_radians).astype(np.float32)
But when trying to calculate BIC, RAM memory is not enough for distance matrix:
did you also try np.float16 ?
dist_matrix = haversine_distances(coords_in_radians).astype(np.float32)
Here you convert type after the computation, what is the type of the output of
haversine_distances
as a function ofcoords_in_radians
type ? You may need to convertcoords_in_radians
first
When trying:
lats_in_radians = np.array([radians(_) for _ in ds['lat'].values]).astype(np.float32)
lons_in_radians = np.array([radians(_) for _ in ds['long'].values]).astype(np.float32)
coords_in_radians = np.column_stack((lats_in_radians, lons_in_radians))
dist_matrix = haversine_distances(coords_in_radians)
I obtain:
I think the output of haversine_distances
is always float64
If I try:
dist_matrix = haversine_distances(coords_in_radians).astype(np.float16)
I obtain:
In the new version of the 'get_regulargrid_dataset' function I calculate the distance matrix only for the profiles around the random profile I choose in each iteration. There is no problems with RAM memory because the matrices are little and it do not take a lot of time to compute:
float in Gulf Stream (8793 profiles): 13s float in Southern Ocean (16435 profiles): 30s float in Agulhas Current (18391 profiles): 30s
Example of results:
Calculating the BIC with this subsampling method gives "reasonable" computation time (with 10 runs and testing k=[1:15]: float in Gulf Stream (8793 profiles): 4 min float in Southern Ocean (16435 profiles): 10 min float in Agulhas Current (18391 profiles): 9 min
example (Agulhas Current):