PatentsView / PatentsView-Disambiguation

30 stars 11 forks source link

Missing input in \pv\disambiguation\assignee\run_clustering.py #3

Closed Markhzz closed 3 years ago

Markhzz commented 3 years ago

Hi Monath,

I'm sorry to bother you! I'm a beginner trying to learn your disambiguation program, and I notice that in the code \pv\disambiguation\assignee\run_clustering.py, there is a missing input 'data/assignee/permid/permid_vectorizer.pkl'. This input was further used in the model.py:

    name_tfidf = SKLearnVectorizerFeatures(**flgs.assignee_name_model**,
                                           'name_tfidf',
                                           lambda x: clean(split(x.normalized_most_frequent)))

Would you mind sharing this file? or would you mind describing this file. I'm sorry if my question is a little bit naive. Thank you so much for your help!

Best, Mark

Markhzz commented 3 years ago

Hi Monath,

Would you mind also describing the data set: permid_entity_info.pkl and why it is used for must_not_links? Thank you so much!!

Best, Mark

nmonath commented 3 years ago

Hi Mark, Sorry for late reply!

The file that you are looking for is available in the resources folder of the save_states branch (sorry, that this isn't on main yet): https://github.com/PatentsView/PatentsView-Disambiguation/tree/save_states ( the file is called permid_vectorizer.pkl). It is a sklearn tfidf vectorizer for strings.

permid_entity_info.pkl contains preprocessed information of each PermID entity. The must not link constraints say that if assignee name X is linked to PermID 123 and assignee name Y is linked to PermID 555 then X cannot be clustered with Y. Note that not every assignee name is linked to a PermID, we do the linking with a high precision rule.

Hope this helps!

Markhzz commented 3 years ago

Hi Monath,

Thank you so much for your detailed explanation!! That's really helpful!!

Have a great day!!

Best, Mark