chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Paragraph Count in Textacy #301

Closed programmer-nlp closed 4 years ago

programmer-nlp commented 4 years ago

context

I use ts.basic_counts of textacy to get readability count data for my content. But there is no feature to count paragraphs. I use a roundabout way of arriving paragraph count using NLTK using the snippet of code below.

Current approach

[corpusReader = (nltk.corpus.reader.plaintext.PlaintextCorpusReader(".","i.txt",para_block_reader=read_line_block)) para = len(corpusReader.paras()) paragraphs = (para-1)]

Is is possible to add paragraph count under ts.basic_counts?

bdewilde commented 4 years ago

Hey @programmer-nlp , I looked into adding this a while back, but couldn't come up with a good, simple rule that would be true in most cases. The issue is that languages are pretty clear about what constitutes a sentence, but less so with paragraphs. Sometimes, a single \n will split paragraphs; other times, it's \n\n or \r\n or re.compile(r"\n{2,4}"). Depends on your text!

The most straightforward way is probably to use regex as I did above, with the pattern tailored as needed. Then you could do paragraph_count = sum(1 for match in re.finditer(pattern, text)). Does that make sense?

programmer-nlp commented 4 years ago

Yes. It makes sense.