Closed romilly closed 2 weeks ago
Thanks for the report @romilly - I'm going to escalate this internally for triage
The cause of this duplication should be resolved in the next release. Thanks for reporting this!
Please note the fix will not solve duplication in previous datasets, and incremental updates will not resolve this. Users impacted by this bug will need to download a fresh copy of upcoming dataset release
Thanks so much for your help in resolving this.
Describe the Bug
The abstracts dataset appears to contain multiple identical repeated rows.
To Reproduce
Download the abstracts dataset for a recent release_id. I found the problem with both the 2024-04-02 and 2024-05-14 releases.
Expected Behavior
I expected one row per corpusid for each paper where an abstract was available, If there was more than one abstract for a given paper, I'd expect to see some difference in the rows.
Actual Behavior
The majority of corpusids in the dataset have multiple rows containing identical data.
Screenshots
Here is a screenshot of the first few rows of an unzipped abstracts download.![duplicate_abstracts](https://github.com/allenai/s2-folks/assets/149310/59b6e33c-c562-4253-8f75-2d91b1d2c96c)
Environment Details
Platform: Linux Mint
I'm using your Python code to download the datasets.
The complete application is here: https://github.com/romilly/s2ag-corpus