SciRuby / daru

Data Analysis in RUby
BSD 2-Clause "Simplified" License
1.04k stars 139 forks source link

Adds support to from_html module #311

Closed athityakumar closed 7 years ago

athityakumar commented 7 years ago

Fixes issue #219

athityakumar commented 7 years ago

@zverok @v0dro : Please review this PR. Are there any code qualities to be refined, or new options to be provided as argument?

athityakumar commented 7 years ago

I was adding another test html file from W3Schools, which is successfully being parsed. However, the strings are truncated while being displayed as a Daru::Vector, or a Daru::DataFrame - which is not desirable. I can see that the dv.data.to_a does contain the full scraped text. Yet, I can't seem to find any truncating, used in the summary / inspect functions anywhere - any help would be appreciated. :smile:

image

zverok commented 7 years ago

I was adding another test html file from W3Schools, which is successfully being parsed.

What's the value of it? Daru is data processing library, not school-quality HTML processing. I'd prefer to see some examples of real data, published by some organizations and useful for analysis.

However, the strings are truncated

Yes, that's the logic of dataframe's inspect, allowing reasonable output on any kind of data. It is desirable behavior. spacing parameter here is an explanation.

v0dro commented 7 years ago

@athityakumar can you rebase with the current master?

athityakumar commented 7 years ago

@v0dro - Sure, but this PR is still under progress - there are more tests to be added from here, and also provide more options to user like tuples , thousands, etc.

On a side note, did you mean to approve the changes on PR #303 ?

v0dro commented 7 years ago

No, I meant that the current progress on this PR is upto the mark. Updating the current master will prevent having to spend additional time later for making changes.

athityakumar commented 7 years ago

@v0dro @zverok - I've rebased with the master, and added index-detection feature too. It's review time now :smile:

athityakumar commented 7 years ago

@zverok @v0dro - I've updated the code, documentation and tests for the Daru::DataFrame#from_htmlmodule. Please review the changes. 😄

Edit : I see that the fixture HTML files added in this PR have increased the repository's size from ~3.4MB to ~4MB. Should I truncate these HTML files, keeping just relevant content like tables?

athityakumar commented 7 years ago

@zverok - I've made the spec changes. One more thing before merging. I see that the fixture HTML files added in this PR have increased the repository's size from ~3.4MB to ~4MB. Should I truncate these HTML files, keeping just relevant content like table tags?

athityakumar commented 7 years ago

Thanks. I'll soon port all the IO modules( including from_html) to daru-io. If a 0.1.0 version of daru-io is released, then we can probably remove the existing IO modules from daru. Please let me know if this can be done after porting is done.