Corpus size too large - Githubissues

JasonKessler / scattertext

Beautiful visualizations of how language differs among document types.

Apache License 2.0

2.23k stars 287 forks source link

Corpus size too large #110

Closed yshi2016 closed 2 years ago

yshi2016 commented 2 years ago

Hello, I used a dataset with more than 1 million rows and the output size is 572MB with the output like below

I am wondering if this is due to the file being too large? Is there a method builtin scatter text to accommodate the size issue, or should we try sampling a subset of the original data? Thank you!

JasonKessler commented 2 years ago

You can try to set the max_docs_per_category parameter of produce_scattertext_explorer to limit the number of documents stored per category. Plotting positions would still be calculated over the whole corpus. Otherwise, your best bet is to downsample the data you use to create your corpus.