PSLmodels / tax-microdata-benchmarking

A project to develop a benchmarked general-purpose dataset for tax reform impact analysis.
https://pslmodels.github.io/tax-microdata-benchmarking/
2 stars 6 forks source link

Determine whether tmd.csv has duplicate records #219

Open donboyd5 opened 2 hours ago

donboyd5 commented 2 hours ago

tmd data files had duplicate records at one point, per issue #107.

Determine whether there are duplicate records now. If not, close this issue.

If there are, eliminate duplicate records and close this issue.

martinholmer commented 2 hours ago

@donboyd5, When I look at issue #107, there is nothing about the contents of the tmd.csv file. So, I don't understand why you say:

tmd data files had duplicate records at one point, per issue https://github.com/PSLmodels/tax-microdata-benchmarking/issues/107

donboyd5 commented 1 minute ago

@martinholmer, tmd_2021.csv, now defunct, is the data discussed in #107. And tmd_2021.csv is a superset of tmd.csv - it had the same records, but more variables. If tmd_2021.csv (examined) had duplicate records then tmd.csv (not examined) must have had duplicate records. Hence the general mention above of "tmd data files" having duplicate records.

I believe @nikhilwoodruff probably solved the duplicate records issue, but he did not weigh in on it and I don't remember the last status - there were a lot of issues to resolve and this could have slipped through the cracks.

Thus the right course of action is for us to determine whether there are duplicate records now. If not, we can close this issue. If there are, we'll prorably want to eliminate duplicate records, even if only by collapsing them, and close this issue.