Trogluddite / loombreaker

Tools for building Topic-Specific Web Indexes (CS-480 Capstone)
MIT License
0 stars 0 forks source link

build cosine similarity comparator (or investigate other quality hueristics) #28

Open Trogluddite opened 10 months ago

Trogluddite commented 10 months ago

This is considered a stretch goal for the CS-480 project, but it's important in the longer term.

We need a way to understand how similar the Markov-produced text is to the source documents.

This will be useful to measure the quality of the results our bot produces, and will probably be useful for citation ranking.

I propose that we adopt a Cosine similarity approach, wherein the markov-produced result is compared to each of a set of source documents.
An example can be found in the loombreaker storyboard, under 'Example Cosine Similarity Comparison': https://github.com/Trogluddite/loombreaker/blob/main/storyboards/component_storyboard_description.md

an article about cosine similarity with some Python examples can be found here: https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90