Closed gryBox closed 5 years ago
Hey @gryBox , I don't want a super-heavy, actively-changing dependency like pandas
in textacy
— the installation and maintenance headaches aren't worth the minor conveniences — but there's nothing wrong with your solution. Is there specific functionality re: metadata that you'd like to see included in textacy? I appreciate that the get/remove methods are a bit hands-on for users.
Btw, here are the docs you're looking for: https://chartbeat-labs.github.io/textacy/api_reference/io.html#textacy.io.utils.split_records
@bdewilde I totally understand your position. What I'd like to see at the Corpus
level is something similar to pandas
describe
i.e min, max
on for readability
and basic_counts
. Obviously this need more flushing out. Let me know what you think and I can develop it further
here are the docs you're looking for
Yes. You used to have an example in the quick start. I found it useful
>>> cw = textacy.datasets.CapitolWords()
>>> cw.download()
>>> records = cw.records(speaker_name={'Hillary Clinton', 'Barack Obama'})
>>> text_stream, metadata_stream = textacy.fileio.split_record_fields(
... records, 'text')
>>> corpus = textacy.Corpus('en', texts=text_stream, metadatas=metadata_stream)
>>> corpus
Corpus(1241 docs; 857058 tok
Closing. We can start another thread on corpus stats if you want to discuss it further
context
I often find myself toggling back and forth between
pandas
dataframe
and atextacy
corpus
depending on what stage the data is in the pipeline.pandas
when analyzing metadata.textacy
when working with text. Although the solution is only three lines of code. I do find myself wanting to write only one line. In addition the example fortextacy.io.split_records
is not in this version of the documentation (?). Which can make re-finding the solution somewhere in a project time consuming. -- Thanks in advance.textacy
has come a long way. Nice job.proposed solution
The solution builds upon existing
textacy
functionality.alternative solutions?
The alternative is to rewrite the repetitive lines of code throughout the pipeline. Or use
nlp.pipe
fromspacy
.