TUM-IDP-WS-20 / doc

0 stars 0 forks source link

Document Similarity measurement: Cosine distance #23

Open farukcankaya opened 3 years ago

farukcankaya commented 3 years ago

We calculate a topic distribution for each document(1) and calculate topic distribution for the given input text (2) with keywords. We need to find similar documents in the matrix of topic/document distribution(1) using the calculated distribution of the given input(2). We used cosine distance in Milestone_1_W_Relevant_Data and Milestone_1 to find similar documents.

Document how this logic works with clear matrix examples:

def produce_rec(top_vec, topic_array, doc_topic_df, rand = 15):
    top_vec = top_vec + np.random.rand(30,)/(np.linalg.norm(top_vec)) * rand
    co_dists = compute_dists(top_vec, topic_array)
    return doc_topic_df.loc[np.argmax(co_dists)]

TODOS:

farukcankaya commented 3 years ago

Ways to find similarity between documents: