fivethirtyeight / data

Data and code behind the articles and graphics at FiveThirtyEight
https://data.fivethirtyeight.com/
Creative Commons Attribution 4.0 International
16.78k stars 10.95k forks source link

Data Cleaning for Riddler Wars #261

Closed aakashsur closed 1 year ago

aakashsur commented 4 years ago

Looks like there are some invalid rows in the data -

https://github.com/fivethirtyeight/data/blob/master/riddler-castles/castle-solutions-3.csv#L238 https://github.com/fivethirtyeight/data/blob/master/riddler-castles/castle-solutions-3.csv#L818 https://github.com/fivethirtyeight/data/blob/master/riddler-castles/castle-solutions-3.csv#L1030

https://github.com/fivethirtyeight/data/blob/master/riddler-castles/castle-solutions-4.csv#L182 https://github.com/fivethirtyeight/data/blob/master/riddler-castles/castle-solutions-4.csv#L278 (O instead of 0) https://github.com/fivethirtyeight/data/blob/master/riddler-castles/castle-solutions-4.csv#L498 https://github.com/fivethirtyeight/data/blob/master/riddler-castles/castle-solutions-4.csv#L853

There's also invalid rows because the number of soldiers does not add up to 100, here are my numbers - 38 invalid rows from first war. 30 invalid rows from second war. 142 invalid rows from third war. 72 invalid rows from fourth war.

jayb commented 1 year ago

Done with the following ruby script:

require 'csv'

valid = (0..100).map(&:to_s)
keys = (1..13).map{|i| "Castle #{i}"}

Dir.glob("castle*.csv").each do |fname|
  rows = CSV.read(fname, :headers => true).map(&:to_h)
  out = []
  rows.each_with_index do |row, i|
    invalid = row.keys.select{|k| keys.index(k)}.select{|k| !valid.index(row[k])}
    total = row.keys.select{|k| keys.index(k)}.map{|k| row[k].to_i}.sum
    if invalid.size > 0 || total != 100
      p [fname, i, total, row.select{|k,v| keys.index(k)}]
    else
      out << row
    end
  end

  CSV.open(fname, 'w') do |csv|
    headers = out.first.keys
    csv << headers
    out.each{|o| csv << headers.map{|h| o[h]}}
  end
end