SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.03k stars 139 forks source link

To optimize the result in from_excel #516

Closed weqopy closed 4 years ago

weqopy commented 4 years ago

There are some files have a merged cell in row 1, so the old codes only return the first column. I put the fixed '0' to opts to custom the row I want.

eg: image I want to get the columns from A to K but it returns only the column A like this: image

I don't know what's wrong with this file at this moment because daru returns well when I created a similar file like this: image and it return this: image

So I optimized the source code and use this way to solve my problem:

irb(main):034:0> df = nil
=> nil
irb(main):035:0> 10.times.each do |i|
irb(main):036:1*   puts i
irb(main):037:1>   df = Daru::DataFrame.from_excel(path, {:row_id => i})
irb(main):038:1>   next if df.vectors.to_a.compact.size <= 7
irb(main):039:1>   df
irb(main):040:1>   break
irb(main):041:1> end
0
1
2
3
=> nil
irb(main):042:0> puts df
#<Daru::DataFrame(47x11)>
weqopy commented 4 years ago

Anyone can help about the failed checks? The commit is well used in my project and I can't understand the details in the Travis CI. @Shekharrajak Could you please help me about this?

Shekharrajak commented 4 years ago

Thanks for the PR. Can you please paste the proper code to reproduce the error and rerun the same from this branch PR.

Shekharrajak commented 4 years ago

Please go through this guide: https://github.com/SciRuby/daru/blob/master/CONTRIBUTING.md

weqopy commented 4 years ago

@Shekharrajak Thanks for your reply. I've edited the code.

Shekharrajak commented 4 years ago

Can you please add some testcases for these changes?

weqopy commented 4 years ago

All checks have passed but I still not sure if these testcases are OK.(In fact that I have never used testcases before...) Could you please have a review about them? @Shekharrajak

Also there is a situation confused me. The source spec/fixtures/test_xls.xls return Integer nums: image

But after I copied and edited the new file spec/fixtures/test_xls_2.xls, it returns Float nums: image

What I did is add a new row in the beginning and save it image

So I have to change the Integer nums to Float manually to through the expect in new file. image

I don't know if this is the right way...

Shekharrajak commented 4 years ago

All checks have passed but I still not sure if these testcases are OK.(In fact that I have never used testcases before...) Could you please have a review about them? @Shekharrajak

Also there is a situation confused me. The source spec/fixtures/test_xls.xls return Integer nums: image

But after I copied and edited the new file spec/fixtures/test_xls_2.xls, it returns Float nums: image

What I did is add a new row in the beginning and save it image

I didn't understand why you added new row.

So I have to change the Integer nums to Float manually to through the expect in new file. image

I don't know if this is the right way...

If previous testcases are failing or you are editing it then it is not good.

weqopy commented 4 years ago

I have modified the testcases according to your opinion. Please have a new review. @Shekharrajak

Shekharrajak commented 4 years ago

To be honest I didn't understand the motive behind this PR. It is neither an enhancement nor the bug fix.

If you can paste the PR description code with output and then same code with this PR changes. Then I may get it clear idea about it. Because testcases doesn't tell anything new.

weqopy commented 4 years ago

To be honest I didn't understand the motive behind this PR. It is neither an enhancement nor the bug fix.

If you can paste the PR description code with output and then same code with this PR changes. Then I may get it clear idea about it. Because testcases doesn't tell anything new.

Alright, let me reinterpret this PR. Now I have a file to read by daru and it's simple form is: image

In the first row, only the first cell is not empty; In the second row, only the first two cell is not empty; And the third row is the real title line I want. (The largest column data *x3 is what I want in reality.)

But what if I use the from_excel method in master branch? image

And that is because it use the row 0 to get the headers regularly. image

So this PR is trying to make it more flexible. Add an option to choice the row of headers as well as the worksheet. Here is what it returns after I use the PR's code: image

I'm sorry that the new testcase is confused and hope this could make you clear. @Shekharrajak

Shekharrajak commented 4 years ago

Thanks, @weqopy ! I understand the case now.

I think excel sheet header should be present in 1st row only. If we consider your implementation then the header (3rd row) should not be added in the dataframe rows, right? Also, I don't see first-row element ('not important') in the output dataframe.

Let me see if there are a better way and concrete case to have this feature.

P.S: You can directly paste the code within quote (under ```) and don't need to use screenshot images.

weqopy commented 4 years ago

If we consider your implementation then the header (3rd row) should not be added in the dataframe rows, right? Also, I don't see first-row element ('not important') in the output dataframe.

There is a same answer for them that the new header (3rd row) was replaced the default header (first-row). For example: If I edit the first row in the file like this: image The only difference between Daru::DataFrame.from_excel(path, {row_id: 0}) and Daru::DataFrame.from_excel(path, {row_id: 2}) is the vectors of the dataframe, and it is not important in my reality usage as they both return the max large data area I want by the from_excel method. image

And it's grateful that you can spend time trying to find a better solution. Thanks! @Shekharrajak