allenai / ForeCite

Apache License 2.0
34 stars 4 forks source link

Documenting output format #10

Open VPetukhov opened 3 years ago

VPetukhov commented 3 years ago

Hi, thanks for the interesting work! I was reusing the results, and it took quite some time to understand the format. Could you please add some description of the files to the README? I gathered some notes myself (below), so maybe just posting them would be useful for others.

The only part I'm still puzzled about is the title_citation_scores.json file. Based on the name and the code, I assume that it shows only scientific concepts, present in titles. However, if we take 100'th element from title_citation_scores.json, it will be 'elastic weight consolidation', the top article is 20e9d860d95531772987f0e34043f543a1953b92, arxiv id 1612.00796. You can see the name is "Overcoming catastrophic forgetting in neural networks", and 'elastic weight consolidation' is present only in the body. And, indeed, when checking title_nps.json for all articles from the 100'th cluster, it finds only one, 042be44eb9b09bc5219d3d86a4052d019ddaf390 (1712.03847). So is it right, that the title_citation_scores.json contains the full body citation scores?

The code to check it is:

import json
with open("./title_nps.json") as f:
    title_nps = json.load(f)

with open("./title_citation_scores.json") as f:
    title_citation_scores = json.load(f)

title_citation_scores[100]

[title_nps[k] for k in title_citation_scores[100][0] if k in title_nps]

My description of the files:

dakinggg commented 3 years ago

I think you flipped term_citations and term_occurrences in your bulleted list, should be

but otherwise, your understanding is correct.

And to answer your question, the "title" part of the file name refers to where the candidate phrases came from. So the candidate phrases all appear in at least one title, somewhere in the corpus. The scores themselves are computed using occurrence in title, abstract, or body. So, for your example, elastic weight consolidation occurs in a title, and thus is considered for scoring, but then "occurrence" for purpose of computing the score is about whether the elastic weight consolidation occurs anywhere in title/abstract/body for each paper. The idea here is that most important concepts end up mentioned in some title, but not necessarily the title of the paper that introduced the concept.

I'll leave this open for now, as I agree that the files should be documented in the readme.

VPetukhov commented 3 years ago

Thanks for the explanations and a quick answer!