Shark-ML / Shark

The Shark Machine Leaning Library. See more:
http://shark-ml.github.io/Shark/
GNU Lesser General Public License v3.0
493 stars 130 forks source link

shark::Exception size mismatch: x().size() == v().size() #258

Closed axiqia closed 5 years ago

axiqia commented 5 years ago

I want to use Random Forest Regression to inference only once, so I try to construct a data like bellow, but I get an exception.

    RFTrainer trainer;
    RFClassifier model;
    trainer.train(model, data);

    int param[] = {1,1,0,1,0,1,1,1,3686400,3686400,3686400,3686400,1,0,0,1,1,1,3686400,3686400,3686400,3686400,1};
    std::vector<RealVector> onetest(param, param+23);
    Data<RealVector> points = createDataFromRange(onetest);

    Data<RealVector> predictions = model(points);
terminate called after throwing an instance of 'shark::Exception'
what():  size mismatch: x().size() == v().size()
[1]    17439 abort (core dumped)  ./ExampleProject

details of data for train: number of data points,11383 input dimension, 23. Is this the right way to contruct data to predict?

Ulfgard commented 5 years ago
int param[] = {1,1,0,1,0,1,1,1,3686400,3686400,3686400,3686400,1,0,0,1,1,1,3686400,3686400,3686400,3686400,1};
std::vector<RealVector> onetest(param, param+23);

this creates a set of 23 vectors and param[] is interpreted as the sizes of those vectors. Just create a single RealVector.

RealVector single(23);
std::copy(param, param+23, single.begin());
unsigned int prediction = model(single);
axiqia commented 5 years ago
int param[] = {1,1,0,1,0,1,1,1,3686400,3686400,3686400,3686400,1,0,0,1,1,1,3686400,3686400,3686400,3686400,1};
std::vector<RealVector> onetest(param, param+23);

this creates a set of 23 vectors and param[] is interpreted as the sizes of those vectors. Just create a single RealVector.

RealVector single(23);
std::copy(param, param+23, single.begin());
unsigned int prediction = model(single);

It do work, thank you so mush. It needs to be pointed out that the model return type is shark::blas::vector<double> in this context. But I have a question again. After many tests, I fond that the prediction value is never bigger than 20, so I print labels of dataTest, all of them are between 10 - 20. It is really different from the original labels. What happened? The following code is written in reference to this.

RegressionDataset data;       
importCSV(data, argv[1], FIRST_COLUMN, 1, ',');
cout << "data labes" << endl; 
cout << data.labels() << endl;

RegressionDataset   dataTest = splitAtElement(data, static_cast<std::size_t(0.8*data.numberOfElements()));
cout << "test label" << endl; 
cout << dataTest.labels() << endl;

cout << "data_after" << endl; 
cout << data.labels() << endl;    
//labels of test data splitted from  the import dataset 
// the first column is the line number, the third column is label
  14235 [1](16.064)
  14236 [1](10.816)
  14237 [1](11.552)
  14238 [1](11.52)
  14239 [1](11.392)
  14240 [1](13.216)
  14241 [1](11.264)
  14242 [1](12.288)
  14243 [1](12.896)
  14244 [1](10.944)
  14245 [1](11.488)
  14246 [1](12.032)
  14247 [1](23.872)
  14248 [1](15.2)
  14249 [1](16.736)
  14250 [1](10.592)
  14251 [1](10.912)
  14252 [1](12.448)
  14253 [1](14.848)
  14254 [1](15.936)
  14255 [1](16.192)
  14256 [1](10.368)
  14257 [1](10.592)
  14258 [1](12.608)
  14259 [1](14.912)
  14260 [1](15.168)
  14261 [1](15.584)
  14262 [1](11.072)
  14263 [1](10.752)
  14264 [1](13.216)
//labes of data imported from csv
  17099 [1](49.824)
  17100 [1](173.472)
  17101 [1](54.784)
  17102 [1](49.44)
  17103 [1](173.632)
  17104 [1](48)
  17105 [1](38.368)
  17106 [1](93.536)
  17107 [1](36.64)
  17108 [1](28.512)
  17109 [1](89.12)
  17110 [1](34.4)
  17111 [1](39.328)
  17112 [1](89.472)
  17113 [1](38.624)
  17114 [1](30.496)
  17115 [1](90.112)
  17116 [1](36.416)
  17117 [1](28.64)
  17118 [1](89.024)
  17119 [1](33.536)
  17120 [1](32.768)
  17121 [1](89.536)
  17122 [1](28.064)
  17123 [1](30.752)
  17124 [1](58.048)
  17125 [1](23.104)

I am really sorry to bother you so many times.

Ulfgard commented 5 years ago

"It needs to be pointed out that the model return type is shark::blas::vector in this context." yeah thought you were doing classification and did not realize you were not using shark 4.0

is this now a different issue, just about splitAtElement? I have trouble understanding your print out. are those supposed to be the same values? you can test for yourself by checking whether the last elements of data before splitting are the same as the elements in test after splitting.

If this is the case, there is no bug in shark. I would give you the hint to use data.shuffle() before splitting, because your training dataset might have some type of order.

axiqia commented 5 years ago

"It needs to be pointed out that the model return type is shark::blas::vector in this context." yeah thought you were doing classification and did not realize you were not using shark 4.0

is this now a different issue, just about splitAtElement? I have trouble understanding your print out. are those supposed to be the same values? you can test for yourself by checking whether the last elements of data before splitting are the same as the elements in test after splitting.

If this is the case, there is no bug in shark. I would give you the hint to use data.shuffle() before splitting, because your training dataset might have some type of order.

I got it, and It is my fault. Thank you so much :)