OxfordML / GPz

GPz 2.0: Heteroscedastic Gaussian processes for uncertain and incomplete data
45 stars 10 forks source link

Strange Results - New Simulated Data Set. #4

Closed andrewczgithub closed 3 years ago

andrewczgithub commented 5 years ago

Hi @csibrahim @bperezorozco @a5a ,

I am trying the algorithm on a new data set and I am getting some strange results. The data set has been simulated and I am using the demo_photoz script as a template. I am getting a large rmse and I am not sure where I am going wrong in my analysis. I have have attached the matlab code below. I was wondering if you could take a look at my logic and advise on improvements accordingly.

Sincere thanks in advance. Best, Andrew

clear all;clc;clear; rng(1); % fix random seed addpath GPz/ % path to GPz addpath(genpath('minFunc_2012/')) % path to minfunc

%%%%%%%%%%%%%% Model options %%%%%%%%%%%%%%%%

m = 1000; % number of basis functions to use [required]

method = 'VD'; % select a method, options = GL, VL, GD, VD, GC and VC [required]

heteroscedastic = true; % learn a heteroscedastic noise process, set to false if only interested in point estimates [default=true]

normalize = true; % pre-process the input by subtracting the means and dividing by the standard deviations [default=true]

maxIter = 500; % maximum number of iterations [default=200] maxAttempts = 50; % maximum iterations to attempt if there is no progress on the validation set [default=infinity]

trainSplit = 0.2; % percentage of data to use for training validSplit = 0.2; % percentage of data to use for validation testSplit = 0.6; % percentage of data to use for testing

inputNoise = true; % false = use mag errors as additional inputs, true = use mag errors as additional input noise

csl_method = 'balanced'; % cost-sensitive learning option: [default='normal'] % 'balanced': to weigh % rare samples more heavily during training % 'normalized': assigns an error cost for each sample = 1/(z+1) % 'normal': no weights assigned, all samples are equally important

binWidth = 0.1; % the width of the bin for 'balanced' cost-sensitive learning [default=range(output)/100] %%%%%%%%%%%%%% Prepare data %%%%%%%%%%%%%%

%new data set

n = 500; % Sample size Mdl = regARIMA('MA',{1.4,0.8},'AR',0.5,'Intercept',3,... 'Variance',1,'Beta',[2;-1.5],'D',1);

rng(1); % For reproducibility X = randn(n,2);

Y = simulate(Mdl,n,'X',X);

figure; plot(Y); title 'Simulated Responses'; axis tight;

[n,d] = size(X); filters = d/2;

% you can also select the size of each sample % [training,validation,testing] = sample(n,10000,10000,10000);

% get the weights for cost-sensitive learning omega = getOmega(Y,csl_method,binWidth);

if(inputNoise) % treat the mag-errors as input noise variance Psi = X(:,filters+1:end).^2; X(:,filters+1:end) = []; else % treat the mag-errors as input additional inputs X(:,filters+1:end) = log(X(:,filters+1:end)); Psi = []; end

% split data into training, validation and testing [training,validation,testing] = sample(n,trainSplit,validSplit,testSplit);

%%%%%%%%%%%%%% Fit the model %%%%%%%%%%%%%%

% initialize the model model = init(X,Y,method,m,'omega',omega,'training',training,'heteroscedastic',heteroscedastic,'normalize',normalize,'Psi',Psi); % train the model model = train(model,X,Y,'omega',omega,'training',training,'validation',validation,'maxIter',maxIter,'maxAttempts',maxAttempts,'Psi',Psi);

%%%%%%%%%%%%%% Compute Metrics %%%%%%%%%%%%%%

% use the model to generate predictions for the test set [mu,sigma,nu,beta_i,gamma] = predict(X,model,'Psi',Psi,'selection',testing);

% mu = the best point estimate % nu = variance due to data density % beta_i = variance due to output noise % gamma = variance due to input noise % sigma = nu+beta_i+gamma

% compute metrics

errors=Y(testing)-mu; hist(errors) plot(mu) hold on plot(Y(testing)) %root mean squared error, i.e. sqrt(mean(errors^2)) rmse = sqrt(metrics(Y(testing),mu,sigma,@(y,mu,sigma) (y-mu).^2));

% mean log likelihood mean(-0.5errors^2/sigma -0.5log(sigma)-0.5log(2pi)) mll = metrics(Y(testing),mu,sigma,@(y,mu,sigma) -0.5(y-mu).^2./sigma - 0.5log(sigma)-0.5log(2pi));

% fraction of data where |z_spec-z_phot|/(1+z_spec)<0.15 fr15 = metrics(Y(testing),mu,sigma,@(y,mu,sigma) 100*(abs(y-mu)./(y+1)<0.15));

% fraction of data where |z_spec-z_phot|/(1+z_spec)<0.05 fr05 = metrics(Y(testing),mu,sigma,@(y,mu,sigma) 100*(abs(y-mu)./(y+1)<0.05));

% bias, i.e. mean(errors) bias = metrics(Y(testing),mu,sigma,@(y,mu,sigma) y-mu);

% print metrics for the entire data fprintf('RMSE\t\tMLL\t\tFR15\t\tFR05\t\tBIAS\n') fprintf('%f\t%f\t%f\t%f\t%f\n',rmse(end),mll(end),fr15(end),fr05(end),bias(end))

%%%%%%%%%%%%%% Display Results %%%%%%%%%%%%%%%%

% reduce the sample for efficient plotting [x,y,color,counts]=reduce(Y(testing),mu,sigma,200);

figure;scatter(x,y,12,log(color),'s','filled');title('Uncertainty');xlabel('Spectroscopic Redshift');ylabel('Photometric Redshift');colormap jet; figure;scatter(x,y,12,log(counts),'s','filled');title('Density');xlabel('Spectroscopic Redshift');ylabel('Photometric Redshift');colormap jet;

% plot the change in metrics as functions of data percentage x = [1 5:5:100]; ind = round(x*length(rmse)/100);

figure;plot(x,rmse(ind),'o-');xlabel('Percentage of Data');ylabel('RMSE'); figure;plot(x,mll(ind),'o-');xlabel('Percentage of Data');ylabel('MLL'); figure;plot(x,fr05(ind),'o-');xlabel('Percentage of Data');ylabel('FR05'); figure;plot(x,fr15(ind),'o-');xlabel('Percentage of Data');ylabel('FR15'); figure;plot(x,bias(ind),'o-');xlabel('Percentage of Data');ylabel('BIAS');

% plot mean and standard deviation of different scores as functions of spectroscopic redshift using 20 bins [centers,means,stds] = bin(Y(testing),Y(testing)-mu,20); figure;errorbar(centers,means,stds,'s');xlabel('Spectroscopic Redshift');ylabel('Bias');

[centers,means,stds] = bin(Y(testing),sqrt(nu),20); figure;errorbar(centers,means,stds,'s');xlabel('Spectroscopic Redshift');ylabel('Model Uncertainty');

[centers,means,stds] = bin(Y(testing),sqrt(beta_i),20); figure;errorbar(centers,means,stds,'s');xlabel('Spectroscopic Redshift');ylabel('Noise Uncertainty');

csibrahim commented 5 years ago

I am not sure what is your objective or what is the data being simulated but bear in mind that this script is optimized for photometric redshift applications. You might have to play with the 'binWidth' parameter as I suspect that the value of 0.1 for your application is too small making the CSL approach having weights that are too high, while this is not taking into account when displaying the results. It should be, when weighted accordingly, perform better; or consider switching the 'csl_method' from 'balanced' to 'normal'.

andrewczgithub commented 5 years ago

Hi @csibrahim Thank you, with your advice,I was able to solve the issue. Just some questions though regarding the experimental set up. I am using the model for a university experiment and I am wondering how do i use this model for a prediction on a new data set.

I am using the dataset, Data_GlobalIdx1 which is stored in matlab and is multiple time series of the major indexes.

I am using your model to predict the sp500 based on the other major indexes. However prediction of the test set assumes we have the future values of the other indexes?

Hence my model would be incorrect?

Would you possibly be able to give some advice and guidance on the topic?

Many thanks, Best, Andrew

load Data_GlobalIdx1 % Import daily index closings

figure plot(dates, ret2price(price2ret(Data))) datetick('x') xlabel('Date') ylabel('Index Value') title ('Relative Daily Index Closings') legend(series, 'Location', 'NorthWest')

figure corrplot(DataTable)

Y=Data(:,6); X=Data(:,[1:2 5]);

[n,d] = size(X); filters = floor(d/2);

% % get the weights for cost-sensitive learning omega = getOmega(Y,csl_method,binWidth);

if(inputNoise) % treat the mag-errors as input noise variance Psi = X(:,filters+1:end).^2; X(:,filters+1:end) = []; else % treat the mag-errors as input additional inputs X(:,filters+1:end) = log(X(:,filters+1:end)); Psi = []; end

%have a look at feature selection of dataset %this can be done a number of different ways

% split data into training, validation and testing [training,validation,testing] = sample(n,trainSplit,validSplit,testSplit);

%%%%%%%%%%%%%% Fit the model %%%%%%%%%%%%%%

tic % initialize the model model = init(X,Y,method,m,'omega',omega,'training',training,'heteroscedastic',heteroscedastic,'normalize',normalize,'Psi',Psi); % train the model

model = train(model,X,Y,'omega',omega,'training',training,'validation',validation,'maxIter',maxIter,'maxAttempts',maxAttempts,'Psi',Psi);

%%%%%%%%%%%%%% Compute Metrics %%%%%%%%%%%%%%

% use the model to generate predictions for the test set [mu,sigma,nu,beta_i,gamma] = predict(X,model,'Psi',Psi,'selection',testing);

toc