North-Seattle-College / ad440-winter2022-thursday-repo

North Seattle College AD 440 Winter 2022 Cloud Practicum class repoitory
Apache License 2.0
2 stars 7 forks source link

Run a clustering algorithm on the exported data to understand the data #15

Closed toddysm closed 2 years ago

toddysm commented 2 years ago

Description

Create an unsupervised clustering algorithm utilizing a python clustering library. This algorithm will help the team understand and use the data from Floop's database more efficiently.

Open questions

  1. What data from the Firestore is going to be evaluated?
    • Just Comment Thread data? or Student Engagement Metrics?
  2. How will the data be structured?
    • Check with #3 on how data is exported
  3. What cluster groups will be needed?
  4. What Clustering Algorithm is the best for the given data, structure, and desired clusters?

Resources

Time Estimate

CalebMcOlin commented 2 years ago

Steps needed:

  1. Compare each item in the list of strings using the Levenshtein distance. The use of the python library python-Levenshtein can be utilized.
  2. Place the flattened results into a list. Ensure to pair the original string with the calculated distance.
  3. Use the data from the list to graph the points
  4. Use a clustering method to group the data in like clusters. See above for clustering algorithm options.

NOTE: Damerau-Levenshtein might be more applicable to our situation. This will allow transpositions (swapping of adjacent symbols). Sentences like: "well done job" and "job well done" would be clustered closer together with Damerau-Levenshtein than they would be for the standard Levenshtein.

CalebMcOlin commented 2 years ago

Example using Hierarchy Cluster.

CalebMcOlin commented 2 years ago

Time Spent

Date Activity Time Spent
1/21/22 Researching clustering 1 hour
1/22/22 Researching clustering (cont...) 2 hours
1/24/22 Researching converting string to clusterable data (Levenshtein, Jaro-Winkler, etc...) 2 hours
1/25/22 Initial test to combine Levenshtein and Hierarchical clustering 1 hour
1/29/22 Coded proof of concept script / created PR 5 hours

Estimated: 10-12 hours Actual: 11 hours