error - Githubissues

windar427 commented 5 years ago

in

def autoclean(Xy, dataset_name, features):
    """Auto-cleans data.

    The following aspects are automatically cleaned:
    show important features;
    show statistical information;
    discover the data type for each feature;
    identify the duplicated rowsl;
    unify the inconsistent column names;
    handle missing values;
    handle outliers.

    Parameters
    ----------
    Xy : array-like
        Complete data.

    dataset_name : string

    features : list
        List of feature names.

    Returns
    -------
    Xy_cleaned : array-like
        Cleaned data.
    """
    X = Xy[:, :-1]
    y = Xy[:, -1]
    show_important_features(X, y, dataset_name, features)
    show_statistical_info(Xy)
    discover_types(Xy)
    Xy = clean_duplicated_rows(Xy)
    features = unify_name_consistency(features)
    features_new, Xy_filled = handle_missing(features, Xy)
    Xy_cleaned = handle_outlier(features_new, Xy_filled)
    return Xy_cleaned

1.X shoud be X=Xy.iloc[:, :-1] and y =Xy.iloc[:, -1] 2.when the data was not clean, should we change the run sequence, put show_important_features to the end

Ji-Zhang commented 5 years ago

Hi @windar427

Could you please explain a bit more why iloc should be used? I am a bit confused as I think they do the same thing.
Yes, that is a good idea. I did think about this, but the idea of show_important_features is to give the user a general idea of the dataset beforehand, and it ignores many problems such as missing values.

windar427 commented 5 years ago

thanks,

I use pandas to load data, so pandas use iloc

I find the other error in

if len(np.unique(y)) > 100 or len(np.unique(y)) > 0.1 * y.shape[0]:
    print("regression")
    print("meta features cannot be extracted as the target is not categorical")
# if classification
else:
    #         print("classification")
    metafeatures_clf = {}
    # compute clustering performance metafeatures
    metafeatures_clf['silhouette'], metafeatures_clf['calinski_harabaz'], metafeatures_clf[
        'davies_bouldin'] = compute_clustering_metafeatures(X)

    # compute landmarking metafeatures
    metafeatures_clf['naive_bayes'], metafeatures_clf['naive_bayes_time'] = pipeline(X, y, GaussianNB())
    metafeatures_clf['linear_discriminant_analysis'], metafeatures_clf[
        'linear_discriminant_analysis_time'] = pipeline(X, y,
                                                        LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto'))
    metafeatures_clf['one_nearest_neighbor'], metafeatures_clf['one_nearest_neighbor_time'] = pipeline(X, y,
                                                                                                       KNeighborsClassifier(
                                                                                                           n_neighbors=1))
    metafeatures_clf['decision_node'], metafeatures_clf['decision_node_time'] = pipeline(X, y,
                                                                                         DecisionTreeClassifier(
                                                                                             criterion='entropy',
                                                                                             splitter='best',
                                                                                             max_depth=1,
                                                                                             random_state=0))
    metafeatures_clf['random_node'], metafeatures_clf['random_node_time'] = pipeline(X, y, DecisionTreeClassifier(
        criterion='entropy', splitter='random', max_depth=1, random_state=0))
    metafeatures = list(metafeatures_clf.values())

return metafeatures

if the regression task, metafeatures referenced None

Ji-Zhang commented 5 years ago

Yes, that should be the case. datacleanbot can only be used to deal with supervised classification tasks for now. Classifications labels are required to compute metafeatures

windar427 commented 5 years ago

so, that should be ignored or metafeatures = None

Ji-Zhang commented 5 years ago

so, that should be ignored or metafeatures = None

Ah yes. Thanks! I will update that.

windar427 commented 5 years ago

in abda no module call abda.bin.spstd_model_ha1.py

Ji-Zhang commented 5 years ago

Only a small part of the Bayesian model is used in datacleanbot

Ji-Zhang / datacleanbot

error #9