AlexMili / torch-dataframe

Utility class to manipulate dataset from CSV file
MIT License
67 stars 8 forks source link

Added a TestSuite, bugfixes and a rockspec #1

Closed gforge closed 8 years ago

gforge commented 8 years ago

Hi Alex,

I love what you've done with the torch-dataframe. This was exactly the foundation that I was looking for. Since I want to add more functionality to it I've started with adding a TestSuite in order to make sure I don't introduce any changes. In the making I found a few bugs, most important the to_tensor didn't work with the ipairs. I've also changed started adding som dok.unpack in order to make the parameter handling smoother. I'm uncertain of how the update works so I haven't added any check there and I slightly changed the unique - it now counts the frequencies of variables.

My plan is to extend your package with:

Note that the current rockspec points to my repository and you need to change this if you want to use the luarocks installer.

Thanks! I hope you find the additions valuable. Max

AlexMili commented 8 years ago

Hey Max !

What an amazing job you did ! I am glad that my little piece of code helped you :) Your TestSuite class is wonderful, that is exactly what was missing to this class !

Your proposal are all welcomes and appreciated ;)

Concerning the unique function counting frequencies, I am not decided if it's better to separate the frequencies count in another function or keep it that way.

About your commits I have some comments :

The update function browse the entire dataset applying condition function (first argument of update) to each line. When a line meet the requirements you set in your custom condition function, the line is passed to the second function in update arguments (update_function arg) and then use _updated_single_row to update the specific line and so on.

Many thanks again for your contribution :)

gforge commented 8 years ago

Hi Alex!

Great to hear that you like my additions. I was afraid that you'd dropped the project and it will be great to have some input on my additions.

I agree about the unique function, I suggest adding a value_conts - this is intuitive and sticks with the original intention of keeping Pandas design principles. Compared to R's table() value_counts makes much more sense.

I'm not sure exactly what you mean by the add_column. As you can see the TestSuite has a learning curve, I found the assertTable a little late (first appears on row 342). Feel free to add/modify whatever you feel like.

I'll add a test for the update(). One thing that I think should be fixed is the where, it currently fetches the first match - the Pandas where returns all matching entries. One way would be to have a private _where that returns the index numbers that the update() also uses.

I will start issues for the suggested extensions so that we can have a closed discussion regarding each extension's implementation.

AlexMili commented 8 years ago

In fact, I don't use the lib everyday so I try to improve it when I have time but I am still in the run !

Anyway, thanks a lot for the new test ! :)

About add_column() I was referring to the file Dataframe.lua and the following code :

if (default_value ~= 0) then
    if (type(default_value) == 'table') then
        assert(#default_value == self.n_rows,
                   'The default values don\'t match the number of rows')
    end
    default_value = default_value
else
    default_value =  0
end

Which could be simplified to :

if (type(default_value) == 'table') then
    assert(#default_value == self.n_rows,
        'The default values don\'t match the number of rows')
end

I totally agree with your suggestions. I am going to close this PR and open an issue for each modification/bug you mentioned to keep it clean. I let you create the issues about your three proposals to give more details about your vision.

gforge commented 8 years ago

Added the suggested simplification. Will you merge the PR or do you want something more?

AlexMili commented 8 years ago

It's perfect !