Open alexsb opened 10 years ago
If this is active, where would the code fit? It feels like there are a range of possibilities, from internal to external:
I would lean towards the external end of the spectrum: We're already expecting a lot from our users for data prep, so this doesn't seem out of line to ask them to use a tool we provide. On the other hand, making the whole thing easier wouldn't be bad, either.
Here's a rough draft of an external script:
Given:
Green: 2
Green, Red, Tasty: 1
Red: 1
Red, Tasty: 2
run:
headers = []
rows = []
File.readlines(ARGV.shift).each do |line|
line.chomp!
labels_joined, count = line.split /\s*:\s*/
labels = labels_joined.split /\s*,\s*/
labels.each do |label|
headers << label unless headers.include?(label)
end
1.upto(count.to_i) do
rows << labels
end
end
puts 'fake_id,' + headers.join(',')
rows.each_with_index do |row, i|
puts i.to_s + ',' + (headers.map do |header|
row.include?(header) ? 1 : 0
end.join(','))
end
to produce:
fake_id,Green,Red,Tasty
0,1,0,0
1,1,0,0
2,1,1,1
3,0,1,0
4,0,1,1
5,0,1,1
(and sorry for the bother if this isn't active.)
Green: 2 Green, Red, Tasty: 1 Red: 1 Red, Tasty 2
This format is useful mainly for cases where we have existing venn diagrams. We wouldn't have any data in the item space.