Open VPetukhov opened 3 years ago
I think you flipped term_citations
and term_occurrences
in your bulleted list, should be
but otherwise, your understanding is correct.
And to answer your question, the "title" part of the file name refers to where the candidate phrases came from. So the candidate phrases all appear in at least one title, somewhere in the corpus. The scores themselves are computed using occurrence in title, abstract, or body. So, for your example, elastic weight consolidation
occurs in a title, and thus is considered for scoring, but then "occurrence" for purpose of computing the score is about whether the elastic weight consolidation
occurs anywhere in title/abstract/body for each paper. The idea here is that most important concepts end up mentioned in some title, but not necessarily the title of the paper that introduced the concept.
I'll leave this open for now, as I agree that the files should be documented in the readme.
Thanks for the explanations and a quick answer!
Hi, thanks for the interesting work! I was reusing the results, and it took quite some time to understand the format. Could you please add some description of the files to the README? I gathered some notes myself (below), so maybe just posting them would be useful for others.
The only part I'm still puzzled about is the
title_citation_scores.json
file. Based on the name and the code, I assume that it shows only scientific concepts, present in titles. However, if we take 100'th element from title_citation_scores.json, it will be 'elastic weight consolidation', the top article is 20e9d860d95531772987f0e34043f543a1953b92, arxiv id 1612.00796. You can see the name is "Overcoming catastrophic forgetting in neural networks", and 'elastic weight consolidation' is present only in the body. And, indeed, when checkingtitle_nps.json
for all articles from the 100'th cluster, it finds only one, 042be44eb9b09bc5219d3d86a4052d019ddaf390 (1712.03847). So is it right, that thetitle_citation_scores.json
contains the full body citation scores?The code to check it is:
My description of the files:
body_nps.json
is a dictionary of noun phrases, which stores ids of papers that mention themabstract_nps.json
andtitle_nps.json
have the same info, but extracted from abstracts / titles correspondinglyarxiv_to_s2_mapping.json
allows to get arxiv ids from s2 ids, which are stored inbody_nps.json
normalization.json
has all possible variants for each of the noun phrasess2_id_to_citing_ids.json
has info on what papers cited the given ones2_id_to_references.json
has the list of reference ids per papertitle_citation_scores.json
is a list of scientific concepts and info on them in the following format:[0]
list of noun phrases, belonging to the concept[1]
maximal score of the concept across all papers[2]
table of info about papers that use this concept, sorted by score descending. Rows have:term_occurrences
)term_citations
)log(term_citations + 1) * (term_citations / term_occurrences)
)title_cnlc_scores.json
andtitle_loor_scores.json
are the entities and scores, obtained with other methods (cnlc
andloor
)