allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
169 stars 28 forks source link

Half of the citations are missing in recent bulk downloads #138

Closed aletar89 closed 1 year ago

aletar89 commented 1 year ago

Hi team, We've noticed the citations bulk download has less citations than usual. The 2023-07-04 corpus had 2.1E9 citations but 2023-08-15 has only 0.9E9 citations. I think there was a similar event in the past were half the citations disappeared and then returned after a couple of releases but this time it's been this way for a couple of releases already.

rodneykinney commented 1 year ago

Those counts don't match what we see on our side. These numbers come from querying the released .json.gz files:

release is_matched  count
2023-07-04  true    2238231599
2023-07-04  false   337395225
2023-08-15  true    2262533935
2023-08-15  false   332121203

(is_matched means citedpaperid IS NOT NULL)

aletar89 commented 1 year ago

Seems to have resolved itself. Maybe the issue was on our side though we didn't find anything to fix, just re-ran it a couple more times and now we get the same numbers as you.