Open akharroubi opened 1 year ago
I'll be working on that this week. Thanks @akharroubi.
I've been exploring the missing values in RF classifier and I think there are some options:
In scikit-learn
, there is a class sklearn.impute.SimpleImputer that replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value. There is also sklearn.impute.KNNImputer that complete missing values using k-Nearest Neighbors.
I'm also working on resolving large datasets memory saturation. For reading the data, I'm using now chunks reading
as implemented in laspy
. For training the model, I think Batch Learning
can be useful. As explained here, the RandomForestClassifier
has a parameter warm_start
that "if it's set to True
, the classifier reuses the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest".
For NaN values generated by CloudCompare (when choosing a fixed radius), I see 2 possible solutions:
Filter these values before reading the file, or interpolate these values from neighboring points, otherwise do the classification without them and interpolate the classification afterward.
Or, if there are no points within a radius r, switch the method to feature calculation based on nearest neighbors.