chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

pandas dataframe to corpus #262

Closed gryBox closed 5 years ago

gryBox commented 5 years ago

context

I often find myself toggling back and forth between pandas dataframe and a textacy corpus depending on what stage the data is in the pipeline. pandas when analyzing metadata. textacy when working with text. Although the solution is only three lines of code. I do find myself wanting to write only one line. In addition the example for textacy.io.split_records is not in this version of the documentation (?). Which can make re-finding the solution somewhere in a project time consuming. -- Thanks in advance. textacy has come a long way. Nice job.

proposed solution

The solution builds upon existing textacy functionality.

textacy.io.read_df

# df to corpus
def read_df(df, text_clmn, lang=en):

    records = df.to_dict(orient="records")
    records_splt = textacy.io.split_records(records, text_clmn, itemwise=True)
    corpus = textacy.Corpus(lang=en, data=records_splt)

    return corpus

alternative solutions?

The alternative is to rewrite the repetitive lines of code throughout the pipeline. Or use nlp.pipe from spacy.

bdewilde commented 5 years ago

Hey @gryBox , I don't want a super-heavy, actively-changing dependency like pandas in textacy — the installation and maintenance headaches aren't worth the minor conveniences — but there's nothing wrong with your solution. Is there specific functionality re: metadata that you'd like to see included in textacy? I appreciate that the get/remove methods are a bit hands-on for users.

Btw, here are the docs you're looking for: https://chartbeat-labs.github.io/textacy/api_reference/io.html#textacy.io.utils.split_records

gryBox commented 5 years ago

@bdewilde I totally understand your position. What I'd like to see at the Corpus level is something similar to pandas describe i.e min, max on for readability and basic_counts. Obviously this need more flushing out. Let me know what you think and I can develop it further

here are the docs you're looking for

Yes. You used to have an example in the quick start. I found it useful

>>> cw = textacy.datasets.CapitolWords()
>>> cw.download()
>>> records = cw.records(speaker_name={'Hillary Clinton', 'Barack Obama'})
>>> text_stream, metadata_stream = textacy.fileio.split_record_fields(
... records, 'text')
>>> corpus = textacy.Corpus('en', texts=text_stream, metadatas=metadata_stream)
>>> corpus
Corpus(1241 docs; 857058 tok

Closing. We can start another thread on corpus stats if you want to discuss it further