Documenting output format

VPetukhov commented 3 years ago

Hi, thanks for the interesting work! I was reusing the results, and it took quite some time to understand the format. Could you please add some description of the files to the README? I gathered some notes myself (below), so maybe just posting them would be useful for others.

The only part I'm still puzzled about is the title_citation_scores.json file. Based on the name and the code, I assume that it shows only scientific concepts, present in titles. However, if we take 100'th element from title_citation_scores.json, it will be 'elastic weight consolidation', the top article is 20e9d860d95531772987f0e34043f543a1953b92, arxiv id 1612.00796. You can see the name is "Overcoming catastrophic forgetting in neural networks", and 'elastic weight consolidation' is present only in the body. And, indeed, when checking title_nps.json for all articles from the 100'th cluster, it finds only one, 042be44eb9b09bc5219d3d86a4052d019ddaf390 (1712.03847). So is it right, that the title_citation_scores.json contains the full body citation scores?

The code to check it is:

import json
with open("./title_nps.json") as f:
    title_nps = json.load(f)

with open("./title_citation_scores.json") as f:
    title_citation_scores = json.load(f)

title_citation_scores[100]

[title_nps[k] for k in title_citation_scores[100][0] if k in title_nps]

My description of the files:

body_nps.json is a dictionary of noun phrases, which stores ids of papers that mention them
abstract_nps.json and title_nps.json have the same info, but extracted from abstracts / titles correspondingly
arxiv_to_s2_mapping.json allows to get arxiv ids from s2 ids, which are stored in body_nps.json
normalization.json has all possible variants for each of the noun phrases
s2_id_to_citing_ids.json has info on what papers cited the given one
s2_id_to_references.json has the list of reference ids per paper
title_citation_scores.json is a list of scientific concepts and info on them in the following format:
- [0] list of noun phrases, belonging to the concept
- [1] maximal score of the concept across all papers
- [2] table of info about papers that use this concept, sorted by score descending. Rows have:
  - S2 paper id
  - number of sampled papers that cite this one and have the same concept (term_occurrences)
  - number of papers that have the same concept (term_citations)
  - total score (log(term_citations + 1) * (term_citations / term_occurrences))
title_cnlc_scores.json and title_loor_scores.json are the entities and scores, obtained with other methods (cnlc and loor)

dakinggg commented 3 years ago

I think you flipped term_citations and term_occurrences in your bulleted list, should be

S2 paper id
number of sampled papers that cite this one and have the same concept (term_citations)
number of papers that have the same concept (term_occurrences)
total score (log(term_citations + 1) * (term_citations / term_occurrences))

but otherwise, your understanding is correct.

And to answer your question, the "title" part of the file name refers to where the candidate phrases came from. So the candidate phrases all appear in at least one title, somewhere in the corpus. The scores themselves are computed using occurrence in title, abstract, or body. So, for your example, elastic weight consolidation occurs in a title, and thus is considered for scoring, but then "occurrence" for purpose of computing the score is about whether the elastic weight consolidation occurs anywhere in title/abstract/body for each paper. The idea here is that most important concepts end up mentioned in some title, but not necessarily the title of the paper that introduced the concept.

I'll leave this open for now, as I agree that the files should be documented in the readme.

VPetukhov commented 3 years ago

Thanks for the explanations and a quick answer!

allenai / ForeCite

Documenting output format #10