Open Trogluddite opened 10 months ago
This is considered a stretch goal for the CS-480 project, but it's important in the longer term.
We need a way to understand how similar the Markov-produced text is to the source documents.
This will be useful to measure the quality of the results our bot produces, and will probably be useful for citation ranking.
I propose that we adopt a Cosine similarity approach, wherein the markov-produced result is compared to each of a set of source documents. An example can be found in the loombreaker storyboard, under 'Example Cosine Similarity Comparison': https://github.com/Trogluddite/loombreaker/blob/main/storyboards/component_storyboard_description.md
an article about cosine similarity with some Python examples can be found here: https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90
This is considered a stretch goal for the CS-480 project, but it's important in the longer term.
We need a way to understand how similar the Markov-produced text is to the source documents.
This will be useful to measure the quality of the results our bot produces, and will probably be useful for citation ranking.
I propose that we adopt a Cosine similarity approach, wherein the markov-produced result is compared to each of a set of source documents.
An example can be found in the loombreaker storyboard, under 'Example Cosine Similarity Comparison': https://github.com/Trogluddite/loombreaker/blob/main/storyboards/component_storyboard_description.md
an article about cosine similarity with some Python examples can be found here: https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90