cstjean / ScikitLearn.jl

Julia implementation of the scikit-learn API https://cstjean.github.io/ScikitLearn.jl/dev/
Other
546 stars 74 forks source link

Libstdcxx patch, remove load_boston examples #118

Closed tylerjthomas9 closed 1 year ago

tylerjthomas9 commented 1 year ago

This PR has three changes.

1

Note: Julia v1.8.3 and v1.6 tests will fail on some of the clustering methods due to a bug in the latest scikit-learn version that they can install (v1.1.1). To get around this, bug, I artificially limited the scikit-learn version tov1.0`. This is a far from elegant solution, but I think it is better than installing broken methods.

2

The second change is that I modified the libstdcxx patch. The current version will not work with Julia 1.8.4 or Julia 1.9. I am hoping this solution is a longer-lasting patch. I have borrowed it from CondaPkg.jl.

3

The second change is that I removed examples/tests with load_boston. This dataset has been removed with the following message:

`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>
codecov-commenter commented 1 year ago

Codecov Report

Base: 67.37% // Head: 67.52% // Increases project coverage by +0.15% :tada:

Coverage data is based on head (3c0eb17) compared to base (69608df). Patch coverage: 67.64% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #118 +/- ## ========================================== + Coverage 67.37% 67.52% +0.15% ========================================== Files 13 13 Lines 754 773 +19 ========================================== + Hits 508 522 +14 - Misses 246 251 +5 ``` | [Impacted Files](https://codecov.io/gh/cstjean/ScikitLearn.jl/pull/118?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None) | Coverage Δ | | |---|---|---| | [src/Skcore.jl](https://codecov.io/gh/cstjean/ScikitLearn.jl/pull/118/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-c3JjL1NrY29yZS5qbA==) | `77.77% <67.64%> (+0.94%)` | :arrow_up: | | [src/grid\_search.jl](https://codecov.io/gh/cstjean/ScikitLearn.jl/pull/118/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None#diff-c3JjL2dyaWRfc2VhcmNoLmps) | `80.14% <0.00%> (-1.20%)` | :arrow_down: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=None)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.