Closed romilly closed 2 weeks ago
@romilly thanks for the report. Escalating internally.
This should now be resolved in the next release. Please note the fix will not solve duplication in previous datasets, and incremental updates will not resolve this. Users impacted by this bug will need to download a fresh copy of upcoming dataset release
Describe the Bug The tldrs dataset contains multiple rows for the same corpusid
To Reproduce Download the tldrs datast
Expected BehaviorI'd expected at most one tldr for a given corpusid.
Actual Behavior Some corpusids have as many as 3 tldrs.
Screenshots Database query and partial result after importing the dataset into a Postgres table:
select corpusid, count(tldrs.corpusid) as count from tldrs group by corpusid order by count desc; corpusid,count 5088,3 25966,3 33101,3 76893,3 110803,3 96193,3 227316,3 502731,3 508505,3 490058,3
Environment Details Platform Linux: Database: Postgres
The application is at https://github.com/romilly/s2ag-corpus