allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
144 stars 25 forks source link

Annotations and Bibliography entry spans data integrity #156

Closed cfiorelli closed 7 months ago

cfiorelli commented 8 months ago

cc Jessica Lam slack report

Describe the bug 1) I've noticed that many annotation spans in the bulk download (obtained on 11 Oct 2023) appear to be repeated. It's easy enough to fix on my end, I'm just wondering whether there's something else going on. 2) For many papers, the bibliography entry spans in the bulk download do not correspond with the references reported by the Papers API. For example, PaperId ea4de8e24447b3debfbe9e9c697ab2b66f6663b6 is referenced by doi:10.24940/theijst/2021/v9/i7/st2107-011 both in the PDF and according to the Papers API, but there is no bibliography entry span for the cited paper in the bulk download. Is this because the bibliography entry could not be detected?

To Reproduce

from collections import Counter
import json

filepath = glob.glob("data/s2orc/*")[0]
with open(filepath, 'r') as f:
    for i, line in enumerate(f):
        _s2orc = json.loads(line)
        for key, val in _s2orc['content']['annotations'].items():
            if isinstance(val, str):
                try:
                    val = json.loads(val)
                except:
                    pass

                _s2orc['content']['annotations'][key] = val

        print(i)
        print(_s2orc['corpusid'])
        print(_s2orc['externalids'])
        print()

        text = _s2orc['content']['text']
        for key in ['title', 'bibentry']:
            print(key)

            spans = _s2orc['content']['annotations'][key]
            if spans:
                span2count = Counter([(span['start'], span['end']) for span in spans])
                print(', '.join(f"{span}: {count}" for span, count in span2count.most_common()))

            print()
        print("-" * 150)

        if i == 10:
            break

Expected behavior 1) Annotation spans are distinct 2) Bibliography entry spans do correspond with the references reported by the Papers API

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Example output from above

0
262154195
{'arxiv': None, 'mag': '2909426167', 'acl': None, 'pubmed': '28989627', 'pubmedcentral': '5620991', 'dblp': None, 'doi': '10.1039/c7sc01347g'}

title
(1, 120): 3, (697, 816): 3

bibentry
(42224, 42444): 3

1
262155081
{'arxiv': None, 'mag': '2118671970', 'acl': None, 'pubmed': '12793882', 'pubmedcentral': '270690', 'dblp': None, 'doi': '10.1186/cc2196'}

title
(1, 56): 2, (867, 922): 2

bibentry

2
262157887
{'arxiv': None, 'mag': None, 'acl': None, 'pubmed': None, 'pubmedcentral': None, 'dblp': None, 'doi': '10.1177/20514158211013240'}

title
(1, 139): 12, (1060, 1198): 12

bibentry
(256119, 256428): 12, (256430, 258756): 12, (258758, 259129): 12

3
262214637
{'arxiv': None, 'mag': None, 'acl': None, 'pubmed': None, 'pubmedcentral': None, 'dblp': None, 'doi': '10.9771/rvh.v12i1.47857'}

title
(1, 175): 2, (295, 469): 2

bibentry
(26837, 27077): 2, (27079, 27343): 2, (27345, 27566): 2, (27568, 27989): 2, (27991, 28157): 2, (28159, 28376): 2, (28378, 28632): 2, (28634, 28780): 2, (28782, 29201): 2, (29203, 29389): 2, (29391, 29540): 2, (29542, 29609): 2, (29611, 29846): 2, (29848, 29963): 2, (29965, 30018): 2, (30020, 30468): 2, (30470, 30950): 2, (30952, 31123): 2, (31125, 31329): 2, (31331, 31615): 2, (31617, 31772): 2, (31774, 31940): 2, (31942, 32223): 2

4
262150823
{'arxiv': None, 'mag': None, 'acl': None, 'pubmed': None, 'pubmedcentral': None, 'dblp': None, 'doi': '10.1021/acs.orglett.1c03278.s001'}

title
(1, 96): 1, (391, 486): 1

bibentry
(15659, 16070): 1, (16072, 16405): 1, (16407, 16884): 1, (16886, 17195): 1, (17197, 17327): 1, (17329, 17431): 1, (17433, 17657): 1, (17659, 17883): 1, (17885, 18108): 1, (18110, 18130): 1, (18132, 18276): 1, (18278, 18435): 1, (18437, 18687): 1, (18689, 19010): 1, (19012, 19149): 1, (19151, 19464): 1, (19466, 19486): 1, (19488, 19745): 1, (19747, 19823): 1, (19825, 19870): 1, (19872, 19924): 1, (19926, 20130): 1, (20132, 20388): 1, (20390, 20563): 1, (20565, 20957): 1, (20959, 21188): 1, (21190, 21372): 1, (21374, 21578): 1

5
262151280
{'arxiv': None, 'mag': None, 'acl': None, 'pubmed': None, 'pubmedcentral': None, 'dblp': None, 'doi': '10.24940/theijst/2021/v9/i7/st2107-011'}

title
(1, 141): 2, (497, 637): 2

bibentry
(16806, 17225): 2, (17227, 17774): 2, (17776, 18148): 2, (18150, 18518): 2, (18520, 18856): 2, (18858, 19160): 2, (19162, 19370): 2, (19372, 19927): 2, (19929, 20246): 2, (20248, 20491): 2, (20493, 20658): 2, (20660, 21046): 2, (21048, 21301): 2, (21303, 21512): 2, (21514, 21723): 2, (21725, 22387): 2, (22389, 22627): 2
rodneykinney commented 8 months ago

It looks like there is a bug in our pipeline that is producing these duplicate annotations. From a small sample size, I found duplicates in 5% of papers, with annotations sometimes duplicated as much as 10x. The annotations are otherwise correct, so the best course is to remove them in your own code. We can fix the bug, but it may take a while for the affected papers to reprocess, and we won't change any already-released datasets.