CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
215 stars 34 forks source link

extract_dict and extract_regex_tok should return TokenSpanArray, not DataFrame #206

Open frreiss opened 3 years ago

frreiss commented 3 years ago

For legacy reasons, the functions extract_dict() and extract_regex_tok() in spanner/extract.py return single-column DataFrames. These functions should return TokenSpanArray objects instead. Users who want a DataFrame can construct one on top of the returned array.

In addition to the testing code in test_extract.py, there is some downstream code in the notebooks that will need to be modified to deal with this API change.

lvntky commented 3 years ago

hello @frreiss I can fork the project and take a look at the issues if you don't start yet.

frreiss commented 3 years ago

Thanks for your interest, @lvntky! We'd be happy to have you work on this issue. You may want to wait for the pull request https://github.com/CODAIT/text-extensions-for-pandas/pull/207, which contains other changes to spanner/extract.py, to be merged.

frreiss commented 3 years ago

Update: PR #207 is merged now; this issue should be unblocked.

lvntky commented 3 years ago

sorry for the delay @frreiss i cant look the GitHub for the two days I was very busy at the job. but if there anything that I can help please inform me. I really like the project. Best wishes!

frreiss commented 3 years ago

@lvntky we welcome contributions of all sizes. We've prepared a list of small changes that would make a good first issue for new contributors. Here's a link: https://github.com/CODAIT/text-extensions-for-pandas/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22

lvntky commented 3 years ago

@frreiss thank you for your friendly approach I really thank you I will hunting the issues at repo :smile: