allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
144 stars 25 forks source link

Bug: the tldrs dataset contains multiple rows for the same corpusid #198

Closed romilly closed 2 weeks ago

romilly commented 1 month ago

Describe the Bug The tldrs dataset contains multiple rows for the same corpusid

To Reproduce Download the tldrs datast

Expected BehaviorI'd expected at most one tldr for a given corpusid.

Actual Behavior Some corpusids have as many as 3 tldrs.

Screenshots Database query and partial result after importing the dataset into a Postgres table:

select corpusid, count(tldrs.corpusid) as count from tldrs group by corpusid order by count desc; corpusid,count 5088,3 25966,3 33101,3 76893,3 110803,3 96193,3 227316,3 502731,3 508505,3 490058,3

Environment Details Platform Linux: Database: Postgres

The application is at https://github.com/romilly/s2ag-corpus

cfiorelli commented 4 weeks ago

@romilly thanks for the report. Escalating internally.

cfiorelli commented 2 weeks ago

This should now be resolved in the next release. Please note the fix will not solve duplication in previous datasets, and incremental updates will not resolve this. Users impacted by this bug will need to download a fresh copy of upcoming dataset release