Open ttsesm opened 3 years ago
Hi @YyzHarry I am having a regression dataset CSV file containing all numeric values. Please let me know https://github.com/YyzHarry/imbalanced-regression/tree/main/imdb-wiki-dir will work my requirement? Thanks
Hi @ttsesm Yes - that could serve as an example!
Hi @snigdhasen Yes, I believe that is a complete codebase, and you might only need to modify the data loading part (and maybe the network you choose to use).
@YyzHarry I found some time so I was going through the paper and also your blog post as well as the links that you have pointed me to, but I still do not get how you apply the LDS/FDS distribution smoothing in practice.
So I would appreciate if you could give a step by step guide how to be done. I think this would be helpful for others as well.
For example in my case I have dataset of point clouds where for each point I have a set of feature vectors, e.g.:
-0.471780000000000 0.702420000000000 0.291670000000000 156.716000000000 0.800000000000000 0.800000000000000 0.800000000000000 1 0 0 0.0111600000000000 0 0 0 8.47483000000000 0 0
-0.471780000000000 0.826370000000000 0.216670000000000 139.612000000000 0.800000000000000 0.800000000000000 0.800000000000000 1 0 0 0.0111600000000000 0 0 0 8.61834000000000 0 0
0.471780000000000 0.280970000000000 0.458330000000000 195.465000000000 0.800000000000000 0.800000000000000 0.800000000000000 -1 0 0 0.0111600000000000 0 0 0 8.56491000000000 0 0
0.206920000000000 -0.239650000000000 0 670.182010000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 8.63796000000000 0 0
0.455220000000000 0.727210000000000 0.500000000000000 107.883000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0 8.65391000000000 0 0
-0.231750000000000 -0.801580000000000 0 250.761000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 8.37285000000000 0 0
0.471780000000000 0.760260000000000 0.0416700000000000 176.562000000000 0.800000000000000 0.800000000000000 0.800000000000000 -1 0 0 0.0111600000000000 0 0 0 8.35862000000000 0 0
-0.157260000000000 0.735470000000000 0.500000000000000 141.367000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0 8.64104000000000 0 0
0.306240000000000 0.305760000000000 0 710.883970000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 8.81857000000000 0 0
0.355900000000000 0.280970000000000 0.500000000000000 235.098010000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0 8.36165000000000 0 0
-0.281410000000000 0.314020000000000 0.500000000000000 208.985990000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0 8.43708000000000 0 0
0.438670000000000 0.636310000000000 0.500000000000000 132.513000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0 8.68539000000000 0 0
-0.471780000000000 0.925540000000000 0.308330000000000 108.584000000000 0.800000000000000 0.800000000000000 0.800000000000000 1 0 0 0.0111600000000000 0 0 0 8.79508000000000 0 0
0.389010000000000 0.909010000000000 0.500000000000000 96.3420000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0 8.47030000000000 0 0
0.0827700000000000 -0.909010000000000 0 203.560000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 8.19117000000000 0 0
0.140710000000000 -0.677630000000000 0.500000000000000 199.156010000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0 8.42757000000000 0 0
0.107600000000000 0.256180000000000 0 710.012020000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 9.49238000000000 0 0
-0.289690000000000 -0.834640000000000 0 236.399000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 8.34452000000000 0 0
0.430390000000000 -0.115690000000000 0 591.968990000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 9.08948000000000 0 0
-0.0910400000000000 0.925540000000000 0 152.154010000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 8.71381000000000 0 0
0.215200000000000 -0.942070000000000 0.0166700000000000 247.403000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 1 0 0.0111700000000000 0 0 0 8.14043000000000 0 0
0.339350000000000 -0.553670000000000 0.500000000000000 198.897000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0 8.21610000000000 0 0
0.471780000000000 0.462770000000000 0.0916700000000000 399.609010000000 0.800000000000000 0.800000000000000 0.800000000000000 -1 0 0 0.0111600000000000 0 0 0 9.02757000000000 0 0
-0.240030000000000 -0.561930000000000 0.500000000000000 253.405000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0 8.36224000000000 0 0
-0.314520000000000 -0.190070000000000 0 1255.18604000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 11.6615100000000 0 0
-0.430390000000000 0.165270000000000 0.500000000000000 219.422000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0 8.10539000000000 0 0
-0.355900000000000 0.859430000000000 0 136.401990000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 8.59122000000000 0 0
-0.389010000000000 0.942070000000000 0.141670000000000 176.037000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 -1 0 0.0111700000000000 0 0 0 8.54202000000000 0 0
-0.306240000000000 -0.776790000000000 0.500000000000000 170.912990000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0 8.26907000000000 0 0
-0.00828000000000000 0.942070000000000 0.258330000000000 211.325000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 -1 0 0.0111700000000000 0 0 0 8.38170000000000 0 0
0.471780000000000 0.0909000000000000 0.366670000000000 405.196010000000 0.800000000000000 0.800000000000000 0.800000000000000 -1 0 0 0.0111600000000000 0 0 0 8.98865000000000 0 0
-0.157260000000000 -0.578460000000000 0 492.231990000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 9.21356000000000 0 0
0.0331100000000000 -0.859430000000000 0 226.514010000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 8.01525000000000 0 0
0.00828000000000000 0.752000000000000 0 214.614000000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 1 0.0110800000000000 0 0 0 8.42254000000000 0 0
0.471780000000000 -0.00826000000000000 0.0916700000000000 521.054990000000 0.800000000000000 0.800000000000000 0.800000000000000 -1 0 0 0.0111600000000000 0 0 0 9.74422000000000 0 0
0.264860000000000 -0.231380000000000 0.500000000000000 235.503010000000 0.800000000000000 0.800000000000000 0.800000000000000 0 0 -1 0.0110800000000000 0 0 0.0329700000000000 7.40915000000000 0.844320000000000 0
....
....
....
....
Now I want to regress the values of column 4, but these values are imbalanced and can vary from the range of 0-10000. For example for the sample above I have split my values in groups, grps = [0 100 250 500 750 1000 2000 5000 10000]
and as you can see the majority of my values lay between the range 250-500:
Now the question is how to apply LDS/FDS based on the values in column 4. Is this done before you load the data in data loader or after and while you are applying the training/testing?
Thanks.
p.s. I attach also an example of a point cloud with the corresponding complete feature vectors just in case it is useful. pcd1.txt
Hi @YyzHarry, any feedback regarding my question above and possibly a step-by-step guide how to apply LDS/FDS.
@ttsesm Sorry for the late reply!
Now the question is how to apply LDS/FDS based on the values in column 4. Is this done before you load the data in data loader or after and while you are applying the training/testing?
This is done after you load the data. For LDS, basically you first get the histogram as you show here for the labels, then we apply smoothing to estimate another "effective" density. After this, typically, LDS is used with loss re-weighting --- you have a weight for each sample to balance the loss. In our case, the implementation for the aforementioned steps can be found here.
For FDS, it is done during training --- just a module like BatchNorm --- inserted into your neural network (see example here). And after each training epoch, you will update the running statistics and smoothed statistics (example here). FDS does not depend on how your label is distributed (do not need the histogram for computation), but you need to first define the number of bins (see the initialization of FDS module here, bucket_num
as how many bins you need).
Hope these help. Let me know if you have further questions.
@ttsesm Sorry for the late reply!
Now the question is how to apply LDS/FDS based on the values in column 4. Is this done before you load the data in data loader or after and while you are applying the training/testing?
This is done after you load the data. For LDS, basically you first get the histogram as you show here for the labels, then we apply smoothing to estimate another "effective" density. After this, typically, LDS is used with loss re-weighting --- you have a weight for each sample to balance the loss. In our case, the implementation for the aforementioned steps can be found here.
For FDS, it is done during training --- just a module like BatchNorm --- inserted into your neural network (see example here). And after each training epoch, you will update the running statistics and smoothed statistics (example here). FDS does not depend on how your label is distributed (do not need the histogram for computation), but you need to first define the number of bins (see the initialization of FDS module here,
bucket_num
as how many bins you need).Hope these help. Let me know if you have further questions.
@YyzHarry thanks a lot for the feedback, it was indeed helpful. So as I understand it with LDS for each target (label) value you create a weight which you then use to balance the loss in a way like the following (this is also what I got from the suppl. material pseudo code in the paper, bellow I use L1Loss as an example):
def forward(self, x, y, weights):
errors = torch.abs(x - y)
return torch.mean(errors * weights)
I played a bit with the LDS, based also on the link that you provided and I created the following running toy example in order to obtain the weights:
import os
import logging
import numpy as np
from scipy.ndimage import convolve1d
from torch.utils import data
import pandas as pd
from utils import get_lds_kernel_window
def _prepare_weights(labels, reweight, max_target=121, lds=False, lds_kernel='gaussian', lds_ks=5, lds_sigma=2):
assert reweight in {'none', 'inverse', 'sqrt_inv'}
assert reweight != 'none' if lds else True, \
"Set reweight to \'sqrt_inv\' (default) or \'inverse\' when using LDS"
value_dict = {x: 0 for x in range(max_target)}
# labels = self.df['age'].values
for label in labels:
value_dict[min(max_target - 1, int(label))] += 1
if reweight == 'sqrt_inv':
value_dict = {k: np.sqrt(v) for k, v in value_dict.items()}
elif reweight == 'inverse':
value_dict = {k: np.clip(v, 5, 1000) for k, v in value_dict.items()} # clip weights for inverse re-weight
num_per_label = [value_dict[min(max_target - 1, int(label))] for label in labels]
if not len(num_per_label) or reweight == 'none':
return None
print(f"Using re-weighting: [{reweight.upper()}]")
if lds:
lds_kernel_window = get_lds_kernel_window(lds_kernel, lds_ks, lds_sigma)
print(f'Using LDS: [{lds_kernel.upper()}] ({lds_ks}/{lds_sigma})')
smoothed_value = convolve1d(
np.asarray([v for _, v in value_dict.items()]), weights=lds_kernel_window, mode='constant')
num_per_label = [smoothed_value[min(max_target - 1, int(label))] for label in labels]
weights = [np.float32(1 / x) for x in num_per_label]
scaling = len(weights) / np.sum(weights)
weights = [scaling * x for x in weights]
return weights
def main():
data = pd.read_csv("./pcd1.txt", header=None, delimiter=',', low_memory=False).to_numpy(dtype='float')
labels = data[:,3]
weights = _prepare_weights(labels, reweight='sqrt_inv', lds=True, lds_kernel='gaussian', lds_ks=5, lds_sigma=2)
return
if __name__ == '__main__':
print('Start!!!!')
main()
print('End!!!!')
os._exit(0)
which seem to work fine.
I have a couple of questions though which I couldn't find the answer or I might have overlooked it:
inverse
and sqrt_inv
, and why should I choose one over the other?gaussian
kernel I noticed that there are the options for triang
, laplace
. So my question is, are these have any major difference to the calculated weights and again is there any specific reason that I should choose one over the other?max_target
hyper parameter and why you have it to 121 as the default value?inverse
option as well as the max_target
parameter), for example in my case my target values my vary from 0 to 25000 where the amount of values above 1500 is quite small. My guess is that for these values the weight will be quite high clipping it to a lower value wouldn't have an affect or not really?What is the difference between the two re-weighting options i.e.
inv
andsqrt_inv
, and why should I choose one over the other?
Actually, we use sqrt_inv
by default for certain tasks (like age estimation). The details of these baselines could be found on Page 6 of the paper. Either sqrt_inv
or inv
belongs to the category of cost-sensitive re-weighting methods; the reason to use sqrt inverse sometimes, is because after inverse re-weighting, some weights might be very large (e.g., consider 5,000 images for age 30, and only 1 image for age 100, then after inverse re-weighting, the weight ratio could be extremely high). This could cause optimization problems. Again, the choices also depend on tasks you are tackling.
Except the
gaussian
kernel I noticed that there are the options fortriang
,laplace
. So my question is, are these have any major difference to the calculated weights and again is there any specific reason that I should choose one over the other?
These just provide more choices. In Appendix E.1 of our paper, we studied some choices of kernel types. Overall, they should give similar results, but some might be better in certain tasks.
What is the affect of the
max_target
hyper parameter and why you have it to 121 as the default value?
The number is just based on the label distribution for the particular age dataset. Since the number of samples with age larger than 120 is very small, we can just aggregate and assign the same weight. The reason is as you said, by applying re-weighting, we do not want the weight to be too high and cause optimization issue.
Is clipping necessary
Your understanding is correct. This is related to the above questions.
Regarding FDS and the number of bins, as I understood this is dependent on the extremes of your values. Is that correct? For the age for example you consider the ages from 0-99 so your bins are 100. In my case I guess that since my value are varying from 0 up to 25000 my number of bins should be to that range, right?
Yes, your understanding is correct. As for your case, it also depends on what the minimum resolution you care (i.e., the bin size). For age, the minimum resolution we care is 1 year, so the bins are 100 if we consider the ages from 0-99. If your minimum resolution that matters is 10, your bins could be 2500 in accordance. Smaller # of bins will make the statistics estimation more accurate, as more samples are considered in each bin. Again, the choices should depend on tasks you are tackling.
Hi @YyzHarry, thanks for the feedback and your time. I will try to play a bit with the different settings and I will let you know if I have any further questions.
Hi @YyzHarry, Thanks for your github link The following image is the label/target distribution of my regression dataset. I tried your Boston dataset colab notebook file and applied on my dataset.
I am getting following output . From which I am understanding MSE value is not gradually decreasing..very much fluctuating in fact. Please let me know Do i need to add some extra lines of code/customization?
Hi @snigdhasen It seems the loss is gradually decreasing (though very slow). I guess the value in the parentheses is the average value for MSE/L1 loss.
@YyzHarry Thanks . yes thats the average loss. But MSE is too high around .99. Can you suggest any customization to reduce loss here. L1 loss is ok around .39.
Hi @YyzHarry, I want to use LDS/FDS to estimate the job processing time, but the time distribution range is relatively large, and I pay more attention to the small-scale samples of time, so I want to use logDuration as the division unit of bin size. May I do this, what is it for symmetric kernels' Requirements, what kind of adjustments need to be made to the hyperparameters?
Hi @zhaosongyi - this is an interesting point. In our work, we use a symmetric kernel since we assume the distance with respect to an anchor point in the target space should not depend on the sign (e.g., for age estimation, 10 year-old and 14 year-old have the same distance to a 12 year-old). Another implicit benefit from symmetric kernels is that they are theoretically guaranteed to make the distribution "smoother" (has lower lipschitz constant).
Going back to your case, when you apply a log transformation to the target labels (and if we still assume the distance for the original processing-time labels does not depend on the sign), I guess you might want to try an asymmetric kernel. A simple implementation with a Gaussian kernel could be a combination of two half Gaussians with different \sigma, where you have a larger \sigma for the left half and a smaller one for the right half.
@YyzHarry Hii I applied only LDS on my Dataset but I am not seeing any improvement in training or validation. Do I need to apply both FDS and LDS on Boston like dataset ? @ttsesm if this method worked for you please ping me on maneeshsagar97@gmail.com
Hi @YyzHarry I have some question about only use LDS on my dataset,however, errors are always reported during operation. Now I want to know that if there are specific format requirements for the input data?I'm dealing with spatiotemporal data with longitude and latitude. I don't know if I can?
Hi, I have a question when training my custom dataset My dataset have the target values in range of (0, 4), and bins are 0.1 The training loss seems fine, however the validation loss is crazy. Debugging shows that the output of the model is very big, like 2e+15.
Can you give some idea about that?
I'm trying to run the train.py in the nyud2-dir directory you provided, and I'm getting negative weight values, which causes the final calculated loss to be negative as well.
Also would like to ask what is the meaning of TRAIN_BUCKET_NUM? How is this data calculated?
Hi @YyzHarry,
I am trying to adapt the example from here https://github.com/YyzHarry/imbalanced-regression/tree/main/agedb-dir with my custom model and data. Thus, I would like to ask you whether this would be feasible and if yes if there are any example showing explicitly how to do that.
Thanks.