Closed Markhzz closed 3 years ago
Hi Monath,
Would you mind also describing the data set: permid_entity_info.pkl and why it is used for must_not_links? Thank you so much!!
Best, Mark
Hi Mark, Sorry for late reply!
The file that you are looking for is available in the resources folder of the save_states branch (sorry, that this isn't on main yet): https://github.com/PatentsView/PatentsView-Disambiguation/tree/save_states ( the file is called permid_vectorizer.pkl). It is a sklearn tfidf vectorizer for strings.
permid_entity_info.pkl contains preprocessed information of each PermID entity. The must not link constraints say that if assignee name X is linked to PermID 123 and assignee name Y is linked to PermID 555 then X cannot be clustered with Y. Note that not every assignee name is linked to a PermID, we do the linking with a high precision rule.
Hope this helps!
Hi Monath,
Thank you so much for your detailed explanation!! That's really helpful!!
Have a great day!!
Best, Mark
Hi Monath,
I'm sorry to bother you! I'm a beginner trying to learn your disambiguation program, and I notice that in the code \pv\disambiguation\assignee\run_clustering.py, there is a missing input 'data/assignee/permid/permid_vectorizer.pkl'. This input was further used in the model.py:
Would you mind sharing this file? or would you mind describing this file. I'm sorry if my question is a little bit naive. Thank you so much for your help!
Best, Mark