USi-IPEM / Demo-Cell-Classification

Classification
Apache License 2.0
1 stars 1 forks source link

Save vector in separate CSV file #9

Closed arpm92 closed 3 years ago

arpm92 commented 3 years ago

It would be useful for further applications to have the input vector of the SVM as CSV file.

It would follow somehow the following structure:

item; x ; y ; z ; conv_speed ; qc

where each row will be equivalent to each sample

v0lta commented 3 years ago

I have introduced the plotting into https://github.com/manubrain/Demo-Cell-Classification/blob/master/classifier/data_loader.py, running the file will run a visualization of all input samples. Is this sufficient? If not I will add the file output.

arpm92 commented 3 years ago

It would be good to have them in a separate file. That would be useful for us in another implementation we are building by using MS Azure.

v0lta commented 3 years ago

26aa9ea2c5e9beeddeb8bb86de19d4712a8db5f2 does this.

arpm92 commented 3 years ago

26aa9ea does this.

If I do not mistake, it does create a file 'x' and 'y' in a folder 'input'. Am I right?

if yes, I think the format is different from the one described above

item; x ; y ; z ; conv_speed ; qc

v0lta commented 3 years ago

You are right it is almost the same format x : no|drop_white_pos_y | drop_white_pos_z | drop_white_pos_x | drop_black_pos_x | drop_black_pos_y | drop_black_pos_z | max_belt y: no|quality

arpm92 commented 3 years ago

Would it be possible to set:

x : no|drop_white_pos_y | drop_white_pos_z | drop_white_pos_x | drop_black_pos_x | drop_black_pos_y | drop_black_pos_z | max_belt y: no| quality | Distance x | Distance y | Distance Abs

There is also a problem with the max_belt. It is set to 0.

Would it be possible to have it all in one csv? Applications in Azure ML studio works better with all info in one file.

no|drop_white_pos_y | drop_white_pos_z | drop_white_pos_x | drop_black_pos_x | drop_black_pos_y | drop_black_pos_z | max_belt | quality | Distance x | Distance y | Distance Abs

v0lta commented 3 years ago

8fa039669c1385a69a0d27e01d2f22b19e331212 does this. I think we currently may not have faster belt speeds in the data because of #15 .

arpm92 commented 3 years ago

After fixing #15 and changing the path_lst, the conv belt speed still remains in 0 (all_in_one.cvs)

v0lta commented 3 years ago

It's normalized see also: https://github.com/manubrain/Demo-Cell-Classification/blob/bc223b6fed61855a438c8294b8d1da1d6e802a70/classifier/data_loader.py#L218

v0lta commented 3 years ago

Zero should mean average speed. This typically works well, if all input sequences contain more or less all values with the same distribution. I did it because our inputs differ vastly in scale, this is an attempt to equalize this. I am not sure how many data points there are in each file, and if the initial assumption is true. If it is not. We should define a better normalization function.

v0lta commented 3 years ago

Do you know how many different speeds we have in the data set?

arpm92 commented 3 years ago

Do you know how many different speeds we have in the data set?

If my assumption is correct, we shall have just 1 point per dataset for the conv speed. Perhaps the normalization would not be the right way to go with the conv_speed

v0lta commented 3 years ago

You are correct. I will get to this tonight.

v0lta commented 3 years ago

I have fixed the normalization, it now considers global mean and standard deviation over the training data in 8b37561b59099a3165df8f0bb505a8ddc208e972. Additionally, normalization can no be set to false in the DataLoader https://github.com/manubrain/Demo-Cell-Classification/blob/8b37561b59099a3165df8f0bb505a8ddc208e972/classifier/data_loader.py#L20 .

v0lta commented 3 years ago

Regarding the output files, the code now produces x_all.csv ( no normalization), x_train.csv (normalized training input), x_test.csv (normalized test split), y_all.csv ( no normalization), y_train.csv ( training labels), y_test.csv (test label split) as well as the all_in_one.csv which contains all values without normalization. If normalization is deactivated it will disappear from the train and test split as well. You may be interested in doing that if the code you are going to do use does its own standadization.

arpm92 commented 3 years ago

Thank you for your support. I am reviewing the output given in the all_in_one.csv, however, there are some of the conv belt speed that are still zero. Is it because the sample does not have any conv_belt speed or another problem.

It is also the case that some data points have a value out of normal. Do you know if this occurs just at the export or it also might affect the training data?

arpm92 commented 3 years ago

By analyzing the data from all_in_one.csv I noticed something that might be a problem.

When the Quality is 1, it means that the da should be da > 1.5 and equal to sqrt((dx)^2+(dy)^2). However, that statement is not true. Is it the case that for dx, dy and dz is considered the normalized vector?

v0lta commented 3 years ago

Thank you for your support. I am reviewing the output given in the all_in_one.csv, however, there are some of the conv belt speed that are still zero. Is it because the sample does not have any conv_belt speed or another problem.

It is also the case that some data points have a value out of normal. Do you know if this occurs just at the export or it also might affect the training data?

Currently the code substitutes zero for missing values.