Closed Oozlum closed 7 years ago
This seems useful for only a few use cases and should probably just be done after parsing.
The reason this (and the other option, rowTable) is useful is that it improves efficiency (both memory and speed) and allows the processing of data in a much more Lua-Idiomatic way, using closures, etc.
Let's take a simple example: we have CSV files containing the inventories of a number of libraries, each having a very large number of books. We are interested only in Crime books, which is a much smaller percentage of each library's inventory.
As it stands, we have to load each CSV file into memory as a string. We have to parse the string into an equivalent Lua table (at which point we've essentially doubled the memory requirement, until the string is released). Finally, we have to iterate the entire table again, finding and processing the correct rows.
With these two options, we can instead do this:
-- return a function that will discard all rows that do not match the given genre and which will
-- insert the location and an incrementing location ID into the row.
local function row_func_generator(genre, location)
local id = 1
return function(row)
if row and row.Genre == genre then
row.Location = location
row.ID = id
id = id + 1
return row
end
-- implicit return nil, discards non-matching rows.
end
end
local options = { rowTable = {} }
options.rowFunc = row_func_generator('Crime', 'BritishMuseum')
ftcsv.parse('british_museum.csv', ',', options)
options.rowFunc = row_func_generator('Crime', 'Boston')
ftcsv.parse('boston.csv', ',', options)
options.rowFunc = row_func_generator('Crime', 'Stockholm')
ftcsv.parse('stockholm.csv', ',', options)
We now have a single table options.rowTable
containing only those data in which we are interested, with location and ID added to each row to identify the source. We haven't used any more memory than required to hold the final data and we've only processed each dataset once.
Hmmm, alright, I see where you are coming from. With the current implementation, yeah - this would be the correct approach for trying to accomplish what you are doing. I think it would probably be better to create an iterator (as per #4). That way extensions (like what have been proposed) could be added without having to pass in extra functions and would end up being far more versatile. This would allow you to merge tables with a small fixed memory overhead (and also modify specific rows). I'll explore the iterator based approach over the next few days.
rowFunc will be called after parsing each data row as follows: output[line] = rowFunc(output[line], line)
if output[line] is nil afterwards, line will not be incremented.