SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 139 forks source link

.from_excel should support .xlsx file extension #344

Closed v0dro closed 6 years ago

v0dro commented 7 years ago

The DataFrame.from_excel method does not support loading XLSX files since the spreadsheet gem does not support it.

We should write new functionality so that XLSX can be supported using the rubyXL gem.

v0dro commented 7 years ago

@athityakumar you should incorporate this into your project if nobody submits a PR before the coding period.

jeydurai commented 7 years ago

It will be good to look at write_excel as well for xlsx .

zverok commented 7 years ago

@jeydurai Ideas on how far it should go? Like styling of output and something something?

jeydurai commented 7 years ago

1) While reading as daru dataframe and writing back as excel, it should preserve the original formatting on both ways. 2) Should be able to read the default sheet (selected one) if we call for read_excel function in daru object, in addition to which, there should be a functionality to create an Excel object, by which, any specific sheet can be read by giving the sheet name with an option to specify which column in the source file to be considered as Index of daru dataframe. 3) We should be able to save multiple daru dataframes as multiple sheet of a single excel file. 2) It should have an engine to give pivot_table functionalities. (The most important) 3) Feature should be there to write the pivot table object as excel file

zverok commented 7 years ago

@jeydurai Thanks, that is pretty informative list, we'll work our way on it.

@athityakumar I believe you should be really interested in this discussion :) @Shekharrajak for you it is also may be useful (as "export to Excel with pivot tables" COULD be part of "views" rather than exporters, but it is a point to discuss again).

athityakumar commented 7 years ago

@v0dro - Sure, I'll try to accommodate these features in my project timeline. @zverok - Yes, I'm looking forward to this. :) @Shekharrajak - Regarding (4) and (5), we can consider using the pivot-table gem.

@jeydurai - Thanks for this list. Feel free to add more, if you come across any more potential use cases - my coding period starts from May 30th. I'll try incorporating as much features as possible within the timeline.

Regarding (3), would something like this suffice as use case?

df1.to_excel "filename.xlsx", sheet: 1
df2.to_excel "filename.xlsx", sheet: 2
jeydurai commented 7 years ago

@athityakumar I will surely come back in a day or two in order to add more use cases. @athityakumar For the use cases of (3), the following syntax may be helpful. Reading: xlsx = Daru::DataFrame.ExcelFile(PATH\FileName.xlsx) df_sheet1 = xlsx.parse('Sheet1') df_sheet2 = xlsx.parse('Sheet2') OR df_sheet1 = xlsx.parse(0) df_sheet2 = xlsx.parse(1)

Writing: (suggested use case looks like saving the file as many times as the number of df/sheets we have. However, the following syntax may be good). writer = Daru::DataFrame.ExcelWriter('path/to/file_name.xlsx', engine='xlsxwriter') # to differentiate if we need XL or XLSX. df1.to_excel(writer, sheet_name='Sheet1') df2.to_excel(writer, sheet_name='Sheet2') df3.to_excel(writer, sheet_name='Sheet3') writer.save (This could act as lazy so that only when we call save method from the writer object, the functionality should start writing to disc.)

Shekharrajak commented 7 years ago

We can create pivot table using the gem pivot_table as suggested by @athityakumar and using Daru::Formatters::Table.format(data, headers: headers, row_headers: row_headers) , we can get the table. I tried one example: gist link

v0dro commented 7 years ago

pivot_table gem looks too inflexible and heavy to me. Moreover there will be the overhead of copying data from daru to the other gem. Maybe you can port some of the code from there to daru?

zverok commented 7 years ago

I also don't think we need any external gem for building pivot tables. We have everything you can need for this in Daru already.

I believe that writing several dataframes to several sheets IS a usecase, yet @athityakumar's proposal (with atomic writes, each "write this DF to that file, N sheet" is separate op) is good enough for starters. As far as I know, XLSX files opening/closing doesn't have a lot of overhead, so from "just dataframe"s point of view atomic writes is more logical.

Though, as a far future idea, support for some sort of "transactions" for writing could be useful, like

But it is "next level", anyways.

Shekharrajak commented 7 years ago

Thanks for suggestions. I think we must extend the DataFrame#pivot_table features accordingly.

athityakumar commented 7 years ago

Submitted PR https://github.com/athityakumar/daru-io/pull/28 for XLSX Importer. Please review. 😄

zverok commented 6 years ago

Implemented in daru-io