allenai / s2-folks

Public space for the user community of Semantic Scholar APIs to share scripts, report issues, and make suggestions.
Other
144 stars 25 forks source link

Bug: the abstracts dataset appears to contain multiple identical repeated rows #197

Closed romilly closed 2 weeks ago

romilly commented 1 month ago

Describe the Bug

The abstracts dataset appears to contain multiple identical repeated rows.

To Reproduce

Download the abstracts dataset for a recent release_id. I found the problem with both the 2024-04-02 and 2024-05-14 releases.

Expected Behavior

I expected one row per corpusid for each paper where an abstract was available, If there was more than one abstract for a given paper, I'd expect to see some difference in the rows.

Actual Behavior

The majority of corpusids in the dataset have multiple rows containing identical data.

Screenshots

Here is a screenshot of the first few rows of an unzipped abstracts download. duplicate_abstracts

Environment Details

Platform: Linux Mint

I'm using your Python code to download the datasets.

The complete application is here: https://github.com/romilly/s2ag-corpus

cfiorelli commented 4 weeks ago

Thanks for the report @romilly - I'm going to escalate this internally for triage

cfiorelli commented 2 weeks ago

The cause of this duplication should be resolved in the next release. Thanks for reporting this!

cfiorelli commented 2 weeks ago

Please note the fix will not solve duplication in previous datasets, and incremental updates will not resolve this. Users impacted by this bug will need to download a fresh copy of upcoming dataset release

romilly commented 1 week ago

Thanks so much for your help in resolving this.