FourierTransformer / ftcsv

a fast csv library written in pure Lua
MIT License
73 stars 20 forks source link

Added rowFunc as optional parameter to ftcsv.parse #8

Closed Oozlum closed 7 years ago

Oozlum commented 7 years ago

rowFunc will be called after parsing each data row as follows: output[line] = rowFunc(output[line], line)

if output[line] is nil afterwards, line will not be incremented.

coveralls commented 7 years ago

Coverage Status

Coverage decreased (-0.2%) to 99.788% when pulling 7bdd4aa5bda3ddaa32303c81c90776bae3d9c1e7 on Oozlum:rowfunc_option into 4a2c22fb0a63cc89a33623ffcc44e08a685893f8 on FourierTransformer:master.

FourierTransformer commented 7 years ago

This seems useful for only a few use cases and should probably just be done after parsing.

Oozlum commented 7 years ago

The reason this (and the other option, rowTable) is useful is that it improves efficiency (both memory and speed) and allows the processing of data in a much more Lua-Idiomatic way, using closures, etc.

Let's take a simple example: we have CSV files containing the inventories of a number of libraries, each having a very large number of books. We are interested only in Crime books, which is a much smaller percentage of each library's inventory.

As it stands, we have to load each CSV file into memory as a string. We have to parse the string into an equivalent Lua table (at which point we've essentially doubled the memory requirement, until the string is released). Finally, we have to iterate the entire table again, finding and processing the correct rows.

With these two options, we can instead do this:

-- return a function that will discard all rows that do not match the given genre and which will
-- insert the location and an incrementing location ID into the row.
local function row_func_generator(genre, location)
  local id = 1
  return function(row)
    if row and row.Genre == genre then
      row.Location = location
      row.ID = id
      id = id + 1
      return row
    end
    -- implicit return nil, discards non-matching rows.
  end
end

local options = { rowTable = {} }

options.rowFunc = row_func_generator('Crime', 'BritishMuseum')
ftcsv.parse('british_museum.csv', ',', options)
options.rowFunc = row_func_generator('Crime', 'Boston')
ftcsv.parse('boston.csv', ',', options)
options.rowFunc = row_func_generator('Crime', 'Stockholm')
ftcsv.parse('stockholm.csv', ',', options)

We now have a single table options.rowTable containing only those data in which we are interested, with location and ID added to each row to identify the source. We haven't used any more memory than required to hold the final data and we've only processed each dataset once.

FourierTransformer commented 7 years ago

Hmmm, alright, I see where you are coming from. With the current implementation, yeah - this would be the correct approach for trying to accomplish what you are doing. I think it would probably be better to create an iterator (as per #4). That way extensions (like what have been proposed) could be added without having to pass in extra functions and would end up being far more versatile. This would allow you to merge tables with a small fixed memory overhead (and also modify specific rows). I'll explore the iterator based approach over the next few days.