exercism / python

Exercism exercises in Python.
https://exercism.org/tracks/python
MIT License
1.87k stars 1.26k forks source link

K-means clusters using raw python #2999

Closed SSahas closed 2 years ago

SSahas commented 2 years ago

Hello, I am Sahas. I wanna create a python problem which is implementing k-means clusters algorithm using raw python. please tell me if this is considerable or too much for an exercise or it requires any changes.

K- means Algorithm : The K-means clustering algorithm computes centroids and repeats until the optimal centroid is found. It is presumptively known how many clusters there are. It is also known as the flat clustering algorithm. The number of clusters found from data by the method is denoted by the letter 'K' in K-means.

clusters gif .

The code is available here :

SSahas / exercism-problem

The code Mainly uses :

github-actions[bot] commented 2 years ago

🤖   🤖

Hi! 👋🏽 👋 Welcome to the Exercism Python Repo!

Thank you for opening an issue! 🐍  🌈 ✨


​          ◦ If you'd also like to make a PR to fix the issue, please have a quick look at the Pull Requests doc.
             We  💙  PRs that follow our Exercism & Track contributing guidelines!


💛  💙  While you are here... If you decide to help out with other open issues, you have our gratitude 🙌 🙌🏽.
Anything tagged with [help wanted] and without [Claimed] is up for grabs.
Comment on the issue and we will reserve it for you. 🌈 ✨

BethanyG commented 2 years ago

Hi @SSahas 👋🏽

Thanks for filing this issue, and for stepping forward to (possibly) design an exercise for Exercism!

TL;DR: Specifications for practice exercises and specifications for concept exercises. Additionally, we use pytest as a runner for the track, so all tests would need to use unttest syntax, and be runnable via pytest. For additional considerations, see the Python Contributing Docs.

While having an algorithm implementation like this might be interesting, I do have some concerns:

  1. This is an implementation of the algorithm with sample data, but to be meaningful to students we've found that solving a specific problem is more engaging and leads to better learning. K-means can be used for spam filtering, fraud detection, audience segmentation, signal processing, image segmentation, and recommendations - among other things. What problem would you center this on, and what would the data and problem statement for it look like?

  2. This isn't "pure Python" or "raw Python" in a "traditional" sense -- your implementation uses Numpy, Pandas, Jupyter and Matplotlib. That isn't bad -- but it does mean the use of libraries beyond the Python standard lib. Since current exercism tooling for the website only supports core Python, we'd need to do some work to support the loading of external libraries such as numpy and pandas. And even with that work, we wouldn't support the use of Jupyter Notebooks or JupyterLab (they include a whole web stack and other complex considerations), and might not be able to support matplotlib in a very effective way, due to its visual nature.

  3. In addition to website tooling, we have the issue of walking students through what they would need to set up to work on the problem via the CLI. There are certainly ways to do this, but it is additional work.

  4. Running the steps of K-means repeatedly to reach optimum partitioning may not fit within the performance needs of our current platform. We'd need code and tests to execute in a maximum of ~10s before timing out. There are also the cases where a k-means implementation never reaches optimum, so we need to be careful of that in the construction of the data set.

  5. K-means partitioning is not deterministic. Outcomes vary depending on the amount and position of the starting centroids and the number of iterations the algorithm goes through. That presents some challenges for student verification, testing, and feedback. I'd want to see what tests looked like for this problem, and run them over multiple solutions before we released anything on the platform/to students.

  6. As it stands now, this isn't a programming or algorithm challenge as much as it is one of deciding how to apply or tune k-means. It also feels as though we'd have to point students at a lot of "prep" documentation, or have a lot of explanation as a set up to this coding challenge. While I am not opposed to a ML or Data Science branch for the Python track, I don't know that I would start with k-means as a first problem, so I'd want some background from you on where you see this problem fitting into the current Python track, and what the supporting documentation/explaination for it would look like.

I also think that the R, Julia, C, JS, Ruby and Go languages (among others) have some pretty powerful tools for both ML and data science, so limiting this problem to a Python-only implementation feels wrong to me. So I think the best strategy would be to discuss this as a more generic practice exercise, rather than a Python concept exercise.

So - I am not saying no outright, but I would like some more details. Looking forward to hearing them. 😄

SSahas commented 2 years ago

Hello @BethanyG,

Thanks for your Response 😄, I am glad you are showing interest, I am not so sure but I will try to resolve this issues as soon as possible , but I cannot implement this algorithm in R, Julia, Ruby, C , JS, Go etc. .. I have not learned this languages. so may be it should be a practice exercise.

SSahas commented 2 years ago

Hey @BethanyG, I think this is hard , this will take time 😅.So i think we should stop this.

BethanyG commented 2 years ago

@SSahas - I'll close for now. But feel free to re-open, should you want to work on this!