Ji-Zhang / datacleanbot

MIT License
8 stars 1 forks source link

error #9

Open windar427 opened 5 years ago

windar427 commented 5 years ago

in

def autoclean(Xy, dataset_name, features):
    """Auto-cleans data.

    The following aspects are automatically cleaned:
    show important features;
    show statistical information;
    discover the data type for each feature;
    identify the duplicated rowsl;
    unify the inconsistent column names;
    handle missing values;
    handle outliers.

    Parameters
    ----------
    Xy : array-like
        Complete data.

    dataset_name : string

    features : list
        List of feature names.

    Returns
    -------
    Xy_cleaned : array-like
        Cleaned data.
    """
    X = Xy[:, :-1]
    y = Xy[:, -1]
    show_important_features(X, y, dataset_name, features)
    show_statistical_info(Xy)
    discover_types(Xy)
    Xy = clean_duplicated_rows(Xy)
    features = unify_name_consistency(features)
    features_new, Xy_filled = handle_missing(features, Xy)
    Xy_cleaned = handle_outlier(features_new, Xy_filled)
    return Xy_cleaned

1.X shoud be X=Xy.iloc[:, :-1] and y =Xy.iloc[:, -1] 2.when the data was not clean, should we change the run sequence, put show_important_features to the end

Ji-Zhang commented 5 years ago

Hi @windar427

  1. Could you please explain a bit more why iloc should be used? I am a bit confused as I think they do the same thing.
  2. Yes, that is a good idea. I did think about this, but the idea of show_important_features is to give the user a general idea of the dataset beforehand, and it ignores many problems such as missing values.
windar427 commented 5 years ago

thanks,

  1. I use pandas to load data, so pandas use iloc
  2. I find the other error in

    if len(np.unique(y)) > 100 or len(np.unique(y)) > 0.1 * y.shape[0]:
        print("regression")
        print("meta features cannot be extracted as the target is not categorical")
    # if classification
    else:
        #         print("classification")
        metafeatures_clf = {}
        # compute clustering performance metafeatures
        metafeatures_clf['silhouette'], metafeatures_clf['calinski_harabaz'], metafeatures_clf[
            'davies_bouldin'] = compute_clustering_metafeatures(X)
    
        # compute landmarking metafeatures
        metafeatures_clf['naive_bayes'], metafeatures_clf['naive_bayes_time'] = pipeline(X, y, GaussianNB())
        metafeatures_clf['linear_discriminant_analysis'], metafeatures_clf[
            'linear_discriminant_analysis_time'] = pipeline(X, y,
                                                            LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto'))
        metafeatures_clf['one_nearest_neighbor'], metafeatures_clf['one_nearest_neighbor_time'] = pipeline(X, y,
                                                                                                           KNeighborsClassifier(
                                                                                                               n_neighbors=1))
        metafeatures_clf['decision_node'], metafeatures_clf['decision_node_time'] = pipeline(X, y,
                                                                                             DecisionTreeClassifier(
                                                                                                 criterion='entropy',
                                                                                                 splitter='best',
                                                                                                 max_depth=1,
                                                                                                 random_state=0))
        metafeatures_clf['random_node'], metafeatures_clf['random_node_time'] = pipeline(X, y, DecisionTreeClassifier(
            criterion='entropy', splitter='random', max_depth=1, random_state=0))
        metafeatures = list(metafeatures_clf.values())
    
    return metafeatures

    if the regression task, metafeatures referenced None

Ji-Zhang commented 5 years ago

Yes, that should be the case. datacleanbot can only be used to deal with supervised classification tasks for now. Classifications labels are required to compute metafeatures

windar427 commented 5 years ago

so, that should be ignored or metafeatures = None

Ji-Zhang commented 5 years ago

so, that should be ignored or metafeatures = None

Ah yes. Thanks! I will update that.

windar427 commented 5 years ago

in abda no module call abda.bin.spstd_model_ha1.py

Ji-Zhang commented 5 years ago

Only a small part of the Bayesian model is used in datacleanbot