Incorrect number of references identified on paper retrieval

serenalotreck commented 1 year ago

Describe the bug Some papers with many references are returned as having none when the paper is obtained though the API.

To Reproduce

import requests

r = requests.post(
    'https://api.semanticscholar.org/graph/v1/paper/batch',
    params={'fields': 'referenceCount,citationCount,title'},
    json={"ids": ["c9695c29c051499f52754f4657d9d559b9898d1a"]}
).json()

r['referenceCount'] is 0, even though the paper has 150 references.

Expected behavior r['referenceCount'] should be equal to 150.

Additional context I've noticed that sometimes the reference count is too high for other papers, which seems to be the result of some parts of a single citation getting broken up into multiple references. However, usually these don't have paperId's associated with them, so it's easy to get rid of them. In contrast, this issue is more problematic but anecdotally less pervasive.

cfiorelli commented 1 year ago

@serenalotreck Sorry a bit behind here. Could you screenshot/show me where you see the 150 ref count? Thank you!

serenalotreck commented 1 year ago

From the PDF of the paper, the references section:

cfiorelli commented 1 year ago

@serenalotreck Got it & Thank you !

cfiorelli commented 1 year ago

@serenalotreck pdp has references now.

serenalotreck commented 1 year ago

@cfiorelli Thank you! Was this specifically fixed for this paper only, or was there some more general fix implemented?

cfiorelli commented 1 year ago

@serenalotreck The fix was specifically on this paper as the cause impact is understood to be very limited.

serenalotreck commented 1 year ago

Hi @cfiorelli,

I think this issue may be more pervasive, so I would like to reopen this issue (let me know if I should be opening a new one).

I was doing some quality checks on my dataset. One of these was to make sure that the nodes in the citation network I created with the dataset are truly unique, and that nodes that should overlap, become one node (i.e. when two papers cite the same paper). To do this, I chose two papers that I knew had many overlapping references, and manually extracted the titles of the references that both papers have in common. One of the papers was the one mentioned in this issue, and the other has paperId = 393cc126bd647a8435072e788a2a033561c6fa97.

There are three problems going on here that I think may be related.

First, in the paper mentioned in this issue, there are more references than there should be, because one or more citations are being incorrectly broken up into separate references with no paperId; there are 159 references in the 'references' attribute, but only 150 in the paper. This issue is less problematic, as I can just filter our anything that doesn't have a paperId; however, it seems to happen all throughout my dataset, so I wanted to mention it in case it's related.

Second, in the paper mentioned in this issue, there are many missing references still; I think the erroneous partial references are filling in the numbers to get to 159.

Third, in the other paper, there is a massive amount of missing references. The object I retrieve from the API only has 14 references, while the paper has almost 3 full pages of references (they aren't numbered, haven't manually counted to get an exact number).

Here is the code that can be used to reproduce the issue. I have manually checked for some of the missing titles and am reasonably confident that there is not a bug in this code re: references missing from the first paper.

r = requests.post(
    'https://api.semanticscholar.org/graph/v1/paper/batch',
    params={'fields': 'title,abstract,references'},
    json={"ids": ["c9695c29c051499f52754f4657d9d559b9898d1a", "393cc126bd647a8435072e788a2a033561c6fa97"]}
).json()
paper1, paper2 = r[0], r[1]

reference_titles = [
    'Drying without dying',
    'Physiological aspects of desiccation tolerance',
    'A footprint of desiccation tolerance in the genome of Xerophyta viscosa',
    'Desiccation tolerance in the vegetative tissues of the fern Mohria caffrorum is seasonally regulated',
    'The evolution of desiccation tolerance in angiosperm plants: a rare yet common phenomenon',
    'Role of ABA and ABI3 in desiccation tolerance',
    'Transcriptional and metabolic changes in the desiccation tolerant plant Craterostigma plantagineum during recurrent exposures to dehydration',
    'A systems‐based molecular biology analysis of resurrection plants for crop and forage improvement in arid environments',
    'A sister group contrast using untargeted global metabolomic analysis delineates the biochemical regulation underlying desiccation tolerance in Sporobolus stapfianus',
    '“To dryness and beyond” – preparation for the dried state and rehydration in vegetative desiccation-tolerant plants',
    'Desiccation tolerance in bryophytes: a reflection of the primitive strategy for plant survival in dehydrating habitats?',
    'Desiccation-tolerance in bryophytes: a review',
    'Ecology of desiccation tolerance in bryophytes: a conceptual framework and methodology',
    'Leaf metabolite profile of the Brazilian resurrection plant Barbacenia purpurea Hook',
    'Massive tandem proliferation of ELIPs supports convergent evolution of desiccation tolerance across land plants',
    'Desiccation tolerance evolved through gene duplication and network rewiring in Lindernia',
    'Trehalose accumulation triggers autophagy during plant desiccation',
    'Sporobolus stapfianus: insights into desiccation tolerance in the resurrection grasses from linking transcriptomics to metabolomics',
    'Comparative metabolic profiling between desiccation‐sensitive and desiccation‐tolerant species of Selaginella reveals insights into the resurrection trait',
    'Molecular responses to dehydration and desiccation in desiccation-tolerant angiosperm plants',
    'Global transcriptome analysis reveals acclimation-primed processes involved in the acquisition of desiccation tolerance in Boea hygrometrica'
]

# Look at the extraneous titles present in paper 1
p1_ref_titles = [p['title'] for p in paper1['references']]
print(p1_ref_titles) ## Examples of bad title: 'Intertwined signatures of desiccaw', '2014.Klebsormidium flaccidum genome'

# Show that the number of references is incorrect
print(len(paper1['references'])) ## 159
print(len(paper2['references'])) ## 14

# Show that the shared references are not present in paper 2 and many are missing from paper 1
p1_ref_titles = [p['title'] for p in paper1['references']]
p2_ref_titles = [p['title'] for p in paper2['references']]
p1_pa = []
p2_pa = []
for ref_title in reference_titles:
    found_1 = False
    for i, auto_title in enumerate(p1_ref_titles):
        if not found_1:
            if ref_title in auto_title:
                p1_pa.append(1)
                found_1 = True
            else:
                if i == len(p1_ref_titles) - 1:
                    p1_pa.append(0)
    found_2 = False
    for i, auto_title in enumerate(p2_ref_titles):
        if not found_2:
            if ref_title in auto_title:
                p2_pa.append(1)
                found_2 = True
            else:
                if i == len(p2_ref_titles) - 1:
                    p2_pa.append(0)
title_df = pd.DataFrame({'title': reference_titles, 'paper1_pres_abs': p1_pa, 'paper2_pres_abs': p2_pa})
print(title_df) ## The paper 1 column has many 0s, the paper 2 column is all 0s

I haven't done too much more manual verification of citation numbers, but I am very suspicious that this is a widespread problem, especially since I have been seeing the erroneous extra citations across the dataset. Looking forward to your thoughts.

Thanks!

EDIT: Just want to clarify that the reference_titles variable are the titles that I have manually confirmed appear in the citations of both papers in their published forms.

cfiorelli commented 1 year ago

@serenalotreck - for the moment im reopening this. will update once i get a chance to connect internally on this. Thanks for the update & details!

serenalotreck commented 1 year ago

Hi @cfiorelli, just wanted to follow up and see what the status is here!

cfiorelli commented 1 year ago

Thanks for pinging for an update. I don't see any issues with your analysis, and thank you for sharing the code example. I needed to review this on my side and am preparing to escalate again for additional internal support.

cfiorelli commented 1 year ago

@serenalotreck Confirming internal team is re-reviewing the report you provided. Thank you!

cfiorelli commented 11 months ago

@serenalotreck - After analysis we've found this data is working as intended but may have experienced some settling, and possibly improvements as a result of work that has been done since we originally discussed.

As far as the original paper goes, it looks fine now.

pdp shows 160, https://www.semanticscholar.org/paper/Desiccation-Tolerance%3A-Avoiding-Cellular-Damage-and-Oliver-Oliver/c9695c29c051499f52754f4657d9d559b9898d1a#cited-papers public api shows 160 $> curl 'https://api.semanticscholar.org/graph/v1/paper/c9695c29c051499f52754f4657d9d559b9898d1a?fields=references' -s | jq '.references|length' 160

As a result of this report we have uncovered some ideas to consider for 2024 improvements. stay tuned !

allenai / s2-folks

Incorrect number of references identified on paper retrieval #142