benmarwick / wordcountaddin

Word counts and readability statistics in R markdown documents
Other
337 stars 33 forks source link

Em dashes hide word from both word counts (koRpus and stringi) #11

Closed jimvine closed 5 years ago

jimvine commented 6 years ago

I was getting unexpected low word counts on my document. After some investigation is seems to be that if there are two or more em dashes (which I enter using three short dashes "---"), any words between the first and last occurrences in document are not counted. I suspect that this might be because anything within them is excluded as if they are YAML.

Example:

Lorem ipsum --- sit amet. Fusce scelerisque, augue eu tempus imperdiet, quam lectus bibendum sapien, vel eleifend ante neque ac metus. Nam feugiat sem velit, sed semper lorem faucibus eu. Nunc dignissim vitae lectus eu convallis.

Nunc quam lorem, cursus vel augue convallis, consequat luctus neque. Mauris eleifend ligula rutrum, varius diam a, finibus ipsum. Praesent consectetur massa quis accumsan mattis. Suspendisse pulvinar est eu luctus ultrices. Integer libero odio, dictum sed leo et, scelerisque posuere dolor. Suspendisse fringilla lectus risus, id egestas elit tincidunt quis.

Sed condimentum mollis scelerisque. Aliquam erat volutpat. Phasellus consequat ultrices diam. Suspendisse potenti. Ut ac tortor quis libero feugiat condimentum. Sed vitae libero ipsum. Phasellus lobortis lobortis vulputate. Phasellus massa nulla, consectetur eget posuere sed, sagittis sit amet neque. Nullam in lacus --- libero.`

Gives this:

|Method          |koRpus    |stringi       |
|:---------------|:---------|:-------------|
|Word count      |3         |3             |
|Character count |20        |19            |
|Sentence count  |1         |Not available |
|Reading time    |0 minutes |0 minutes     |

It gets worse if I select the whole document before running the addin (i.e., including the YAML header:

---
title: "Untitled"
author: "Author"
date: "2 August 2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Lorem ipsum --- sit amet. Fusce scelerisque, augue eu tempus imperdiet, quam lectus bibendum sapien, vel eleifend ante neque ac metus. Nam feugiat sem velit, sed semper lorem faucibus eu. Nunc dignissim vitae lectus eu convallis.

Nunc quam lorem, cursus vel augue convallis, consequat luctus neque. Mauris eleifend ligula rutrum, varius diam a, finibus ipsum. Praesent consectetur massa quis accumsan mattis. Suspendisse pulvinar est eu luctus ultrices. Integer libero odio, dictum sed leo et, scelerisque posuere dolor. Suspendisse fringilla lectus risus, id egestas elit tincidunt quis.

Sed condimentum mollis scelerisque. Aliquam erat volutpat. Phasellus consequat ultrices diam. Suspendisse potenti. Ut ac tortor quis libero feugiat condimentum. Sed vitae libero ipsum. Phasellus lobortis lobortis vulputate. Phasellus massa nulla, consectetur eget posuere sed, sagittis sit amet neque. Nullam in lacus --- libero.

This gives:

|Method          |koRpus    |stringi       |
|:---------------|:---------|:-------------|
|Word count      |1         |1             |
|Character count |8         |7             |
|Sentence count  |1         |Not available |
|Reading time    |0 minutes |0 minutes     |

So it looks like they are just picking up the last "libero" after the final em dash.

benmarwick commented 6 years ago

Thanks for documenting that so thoroughly! Looks like I need to the make the YAML-excluding function a bit more specific. Currently I just have gsub("---.*--- ", "", text), which is obviously a problem if we have three dashes anywhere else in the document.

I will think a bit about how to change this to be more specific to YAML so it wont notice dashes in the text. If you have any suggestions, please let me know!

jimvine commented 6 years ago

Whenever I've seen YAML blocks they always seem to have the three dashes on lines on their own, so that might be the trick. Perhaps this regex I found in a Gist might provide some hints:

(?s)^(---)$.+?^(---)$.+?(?=^---$)

https://gist.github.com/arthurattwell/4219720913b8c0066e65eb300fc31790#copy-paste-to-split-book-into-separate-chapter-files

I'm not an expert on regex, but reading the explanation of it, I think you might just need to have the first few bits of it:

(?s)^(---)$.+?^(---)$

The Pandoc manual says:

A YAML metadata block is a valid YAML object, delimited by a line of three hyphens (---) at the top and a line of three hyphens (---) or three dots (...) at the bottom. A YAML metadata block may occur anywhere in the document, but if it is not at the beginning, it must be preceded by a blank line. https://pandoc.org/MANUAL.html#extension-yaml_metadata_block

So perhaps technically your regex ought to be able to find three dots closing a YAML block as well as three dashes, though I suspect that's pretty uncommon to find in Rmarkdown documents in the wild.

benmarwick commented 5 years ago

I think I might have dealt with this in #28