Annotations and Bibliography entry spans data integrity

cc Jessica Lam slack report

Describe the bug 1) I've noticed that many annotation spans in the bulk download (obtained on 11 Oct 2023) appear to be repeated. It's easy enough to fix on my end, I'm just wondering whether there's something else going on. 2) For many papers, the bibliography entry spans in the bulk download do not correspond with the references reported by the Papers API. For example, PaperId ea4de8e24447b3debfbe9e9c697ab2b66f6663b6 is referenced by doi:10.24940/theijst/2021/v9/i7/st2107-011 both in the PDF and according to the Papers API, but there is no bibliography entry span for the cited paper in the bulk download. Is this because the bibliography entry could not be detected?

To Reproduce

from collections import Counter
import json

filepath = glob.glob("data/s2orc/*")[0]
with open(filepath, 'r') as f:
    for i, line in enumerate(f):
        _s2orc = json.loads(line)
        for key, val in _s2orc['content']['annotations'].items():
            if isinstance(val, str):
                try:
                    val = json.loads(val)
                except:
                    pass

                _s2orc['content']['annotations'][key] = val

        print(i)
        print(_s2orc['corpusid'])
        print(_s2orc['externalids'])
        print()

        text = _s2orc['content']['text']
        for key in ['title', 'bibentry']:
            print(key)

            spans = _s2orc['content']['annotations'][key]
            if spans:
                span2count = Counter([(span['start'], span['end']) for span in spans])
                print(', '.join(f"{span}: {count}" for span, count in span2count.most_common()))

            print()
        print("-" * 150)

        if i == 10:
            break

Expected behavior 1) Annotation spans are distinct 2) Bibliography entry spans do correspond with the references reported by the Papers API

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Example output from above

0
262154195
{'arxiv': None, 'mag': '2909426167', 'acl': None, 'pubmed': '28989627', 'pubmedcentral': '5620991', 'dblp': None, 'doi': '10.1039/c7sc01347g'}

title
(1, 120): 3, (697, 816): 3

bibentry
(42224, 42444): 3

1
262155081
{'arxiv': None, 'mag': '2118671970', 'acl': None, 'pubmed': '12793882', 'pubmedcentral': '270690', 'dblp': None, 'doi': '10.1186/cc2196'}

title
(1, 56): 2, (867, 922): 2

bibentry

2
262157887
{'arxiv': None, 'mag': None, 'acl': None, 'pubmed': None, 'pubmedcentral': None, 'dblp': None, 'doi': '10.1177/20514158211013240'}

title
(1, 139): 12, (1060, 1198): 12

bibentry
(256119, 256428): 12, (256430, 258756): 12, (258758, 259129): 12

3
262214637
{'arxiv': None, 'mag': None, 'acl': None, 'pubmed': None, 'pubmedcentral': None, 'dblp': None, 'doi': '10.9771/rvh.v12i1.47857'}

title
(1, 175): 2, (295, 469): 2

bibentry
(26837, 27077): 2, (27079, 27343): 2, (27345, 27566): 2, (27568, 27989): 2, (27991, 28157): 2, (28159, 28376): 2, (28378, 28632): 2, (28634, 28780): 2, (28782, 29201): 2, (29203, 29389): 2, (29391, 29540): 2, (29542, 29609): 2, (29611, 29846): 2, (29848, 29963): 2, (29965, 30018): 2, (30020, 30468): 2, (30470, 30950): 2, (30952, 31123): 2, (31125, 31329): 2, (31331, 31615): 2, (31617, 31772): 2, (31774, 31940): 2, (31942, 32223): 2

4
262150823
{'arxiv': None, 'mag': None, 'acl': None, 'pubmed': None, 'pubmedcentral': None, 'dblp': None, 'doi': '10.1021/acs.orglett.1c03278.s001'}

title
(1, 96): 1, (391, 486): 1

bibentry
(15659, 16070): 1, (16072, 16405): 1, (16407, 16884): 1, (16886, 17195): 1, (17197, 17327): 1, (17329, 17431): 1, (17433, 17657): 1, (17659, 17883): 1, (17885, 18108): 1, (18110, 18130): 1, (18132, 18276): 1, (18278, 18435): 1, (18437, 18687): 1, (18689, 19010): 1, (19012, 19149): 1, (19151, 19464): 1, (19466, 19486): 1, (19488, 19745): 1, (19747, 19823): 1, (19825, 19870): 1, (19872, 19924): 1, (19926, 20130): 1, (20132, 20388): 1, (20390, 20563): 1, (20565, 20957): 1, (20959, 21188): 1, (21190, 21372): 1, (21374, 21578): 1

5
262151280
{'arxiv': None, 'mag': None, 'acl': None, 'pubmed': None, 'pubmedcentral': None, 'dblp': None, 'doi': '10.24940/theijst/2021/v9/i7/st2107-011'}

title
(1, 141): 2, (497, 637): 2

bibentry
(16806, 17225): 2, (17227, 17774): 2, (17776, 18148): 2, (18150, 18518): 2, (18520, 18856): 2, (18858, 19160): 2, (19162, 19370): 2, (19372, 19927): 2, (19929, 20246): 2, (20248, 20491): 2, (20493, 20658): 2, (20660, 21046): 2, (21048, 21301): 2, (21303, 21512): 2, (21514, 21723): 2, (21725, 22387): 2, (22389, 22627): 2

allenai / s2-folks

Annotations and Bibliography entry spans data integrity #156