Closed athityakumar closed 7 years ago
@zverok @v0dro : Please review this PR. Are there any code qualities to be refined, or new options to be provided as argument?
I was adding another test html file from W3Schools, which is successfully being parsed. However, the strings are truncated while being displayed as a Daru::Vector
, or a Daru::DataFrame
- which is not desirable. I can see that the dv.data.to_a
does contain the full scraped text. Yet, I can't seem to find any truncating, used in the summary / inspect functions anywhere - any help would be appreciated. :smile:
I was adding another test html file from W3Schools, which is successfully being parsed.
What's the value of it? Daru is data processing library, not school-quality HTML processing. I'd prefer to see some examples of real data, published by some organizations and useful for analysis.
However, the strings are truncated
Yes, that's the logic of dataframe's inspect
, allowing reasonable output on any kind of data. It is desirable behavior. spacing parameter here is an explanation.
@athityakumar can you rebase with the current master?
@v0dro - Sure, but this PR is still under progress - there are more tests to be added from here, and also provide more options to user like tuples
, thousands
, etc.
On a side note, did you mean to approve the changes on PR #303 ?
No, I meant that the current progress on this PR is upto the mark. Updating the current master will prevent having to spend additional time later for making changes.
@v0dro @zverok - I've rebased with the master, and added index-detection feature too. It's review time now :smile:
@zverok @v0dro - I've updated the code, documentation and tests for the Daru::DataFrame#from_html
module. Please review the changes. 😄
Edit : I see that the fixture HTML files added in this PR have increased the repository's size from ~3.4MB to ~4MB. Should I truncate these HTML files, keeping just relevant content like tables?
@zverok - I've made the spec changes. One more thing before merging. I see that the fixture HTML files added in this PR have increased the repository's size from ~3.4MB to ~4MB. Should I truncate these HTML files, keeping just relevant content like table tags?
Thanks. I'll soon port all the IO modules( including from_html
) to daru-io. If a 0.1.0
version of daru-io is released, then we can probably remove the existing IO modules from daru. Please let me know if this can be done after porting is done.
Fixes issue #219