NaN value of features - Githubissues

akharroubi commented 1 year ago

For NaN values generated by CloudCompare (when choosing a fixed radius), I see 2 possible solutions:

Filter these values before reading the file, or interpolate these values from neighboring points, otherwise do the classification without them and interpolate the classification afterward.
Or, if there are no points within a radius r, switch the method to feature calculation based on nearest neighbors.

Yarroudh commented 1 year ago

I'll be working on that this week. Thanks @akharroubi.

Yarroudh commented 1 year ago

I've been exploring the missing values in RF classifier and I think there are some options:

Completely drop NaN values and train the model (not recommanded).
Fill in the missing values with median, mean, or mode.
Estimates missing features using nearest samples.

In scikit-learn, there is a class sklearn.impute.SimpleImputer that replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value. There is also sklearn.impute.KNNImputer that complete missing values using k-Nearest Neighbors.

I'm also working on resolving large datasets memory saturation. For reading the data, I'm using now chunks reading as implemented in laspy. For training the model, I think Batch Learning can be useful. As explained here, the RandomForestClassifier has a parameter warm_start that "if it's set to True, the classifier reuses the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest".

Yarroudh / SemanticML

NaN value of features #7