Document Similarity measurement: Cosine distance

We calculate a topic distribution for each document(1) and calculate topic distribution for the given input text (2) with keywords. We need to find similar documents in the matrix of topic/document distribution(1) using the calculated distribution of the given input(2). We used cosine distance in Milestone_1_W_Relevant_Data and Milestone_1 to find similar documents.

Document how this logic works with clear matrix examples:

def produce_rec(top_vec, topic_array, doc_topic_df, rand = 15):
    top_vec = top_vec + np.random.rand(30,)/(np.linalg.norm(top_vec)) * rand
    co_dists = compute_dists(top_vec, topic_array)
    return doc_topic_df.loc[np.argmax(co_dists)]

TODOS:

[ ] Document how cosine distance similarity works with clear matrix examples
[ ] Search other similarity measurement techniques that could be used in our structure to make a recommendation

TUM-IDP-WS-20 / doc

Document Similarity measurement: Cosine distance #23