JackyXu-Cool / Team-2130-Machine-Learning-Roulette

Georgia Tech Junior Design Project
8 stars 5 forks source link

Upload Stage Adjustment #49

Closed JackyXu-Cool closed 1 year ago

JackyXu-Cool commented 1 year ago

In this PR, I put "Select ML model" as our first step of the upload process and "upload dataset" as the second step. However, the number of dataset needed won't adjust accordingly based on the ML model selected for now. Will focus on that in the next PR.

Question about dataset needed:

  1. KMeans: We need just one training dataset and one Y-label dataset. (Y-label is optional)
  2. Decision Tree: one training dataset, one testing dataset, one training Y-label dataset, and one testing y-label dataset.
  3. Naive Bayes: training dataset and y-label (required)

@ruokun-niu Do I understand the logic right? I will start to work on this if this looks good to you.

ruokun-niu commented 1 year ago

Thanks @JackyXu-Cool for doing this! I have added my comments below:

Question about dataset needed: KMeans: We need just one training dataset and one Y-label dataset. (Y-label is optional) This is correct. If the user wants to calculate the accuracy (and perhaps some other metrics), they will need to upload a y-label. Otherwise, clustering can be done with just the X dataset.

Hierarchical clustering: Similar to kmeans

Naive Bayes: training dataset and y-label (required) Yes this is correct. Technically we need all four datasets (xtrain, xtest, ytrain, ytest), but I took a look at @lhyelinn 's code and realized that she actually has already performed data splitting in her code (namelyhttps://github.com/JackyXu-Cool/Team-2130-Machine-Learning-Roulette/blob/master/mlr_backend/naivebayes/naivebayes.py#L74). This allows user to just upload two datasets: X and y-label. I think this is a pretty neat feature

Decision Tree: one training dataset, one testing dataset, one training Y-label dataset, and one testing y-label dataset. As of right now, we need all four datasets. However, after taking a look at Hyelin's code (sparks of inspiration yay!) and I think we can actually do something similar on data splitting. There are two approaches that we can go with: **1. Keep everything as is. Ask the users to upload four datasets if they want to use dtree

  1. Ask the users to upload two datasets: X dataset and Y-label. We will split the dataset into training data and testing data. I personally think there are two advantages to this: 1. we can keep everything consistent. For any ML algorithm, user will only need to upload at most 2 datasets (It can be a bit confusing for the users to upload a varying amount of datasets and it can also be difficult to keep track of everything in the database). 2. We create a new input parameter called training % (optional). This allows the user to determine the part/amount of the dataset that will be used for training and for testing. This is beneficial as it may be able to help the user to diagnose the problem of overfitting & underfitting. For example, let us say that I am an user, and I entered 70 for this parameter. Then 70% of my uploaded dataset will be used for training, and the rest will be used for testing. I just think this is a pretty neat and easy feature to add. We can also just not add this parameter, and just pick a pre-determined value and split the dataset based on that.**

@lhyelinn @hloneal @honeal3 Can you double-check that what I said about your algorithm is correct? Also I want to cc @mmmmartyzhao @Timiport to get their inputs on the idea of adding a new parameter for data splitting.

Feel free to ping me anytime regarding this :smiling_face_with_three_hearts:

Timiport commented 1 year ago

I think adding a parameter for data splitting would be a better idea. This will allow a uniform front-end layout for every ML algorithms.