crsh / papaja

papaja (Preparing APA Journal Articles) is an R package that provides document formats to produce complete APA manuscripts from RMarkdown-files (PDF and Word documents) and helper functions that facilitate reporting statistics, tables, and plots.
https://frederikaust.com/papaja_man/
Other
649 stars 132 forks source link

Word count #63

Open m-Py opened 8 years ago

m-Py commented 8 years ago

Hi,

is it currently possible to count the number of words in the manuscript (excluding abstract and references) automated? The example manuscript states 'too lazy to count', which I take as a hint that this is not possible? Any other suggestions about how to do it, other than manual counting?

Best regards, Martin

crsh commented 8 years ago

Hi Martin,

the answer to this question depends a little on the target document type you are working with. For MS Word documents I'd recommend to simply generate the document and then enter the word count by hand using Word's word count feature. If you are going through LaTeX to produce a PDF I suggest you try TeXcount, which was probably installed along with your TeX distrubution. It's a command line tool (but there is also a web interface, see the link) and has served me well. Let me know if this works for you.

Best regards, Frederik

crsh commented 8 years ago

Thinking a little more about this, I forgot to mention that the way pandoc manages references TeXcount will, by default, count your references against the word limit. So maybe you need to remove the reference section manually from the TeX source file before calling TeXcount.

m-Py commented 8 years ago

Hi Frederik,

thanks a lot for your response! I will look into TeXCount. I am converting into pdf, but I'll probably need to use MS Word as well as soon as I need to collaborate with others. Removing the reference section for word count is not a problem.

Best regards,
Martin

crsh commented 8 years ago

There's a new kid on the block: Try wordcountaddin. This is an R solution that works on the RMarkdown file directly. I haven't played with it and I'm not sure how reliable it is; e.g., I don't know if it counts text in tables etc. If you give it a try, I'd be interested in your experience.

m-Py commented 8 years ago

Ah, that seems nice. It says it is an RStudio addin, but you can also use an internal function to process any R strings without using RStudio. In a first test this worked, for example r chunks were not counted. It should be able to read whole Rmd text files, too. Thanks for that catch! I will report as soon as I have done more testing with Rmd files.

m-Py commented 8 years ago

I played around a little and it seems that I was able to write a function that reads an Rmd file and then uses the word count function of wordcountaddin to count all words in this file. Indeed, the r chunks, inline r and yaml headers are ignored. Inline LaTex commands are not ignored, but maybe this will be implemented at some point.

The two word count methods that are offered (koRpus & stringi) differ rather strongly in my case, but koRpus yields reasonable results. The koRpus estimate is close to what I get when I import the text into Libre Office and just remove all r chunks and the yaml header (stringi is actually very far off).

I will use this function from now on :-)

crsh commented 8 years ago

That's good to know. Thanks for providing the feedback. I had noticed rather large discrepancies between the word count methods, too, but I hadn't yet cross-validated them. Thanks for that. I'd be interested in the function you wrote because I've been wanting to automate word counting in papaja. Would you be willing to share your function, e.g., in a gist?

m-Py commented 8 years ago

Sure, gladly, here is the gist:

https://gist.github.com/m-Py/faf679a0a0be3dbafa2b43b390519923

I crossvalidated the function with the RStudio addin - the results are the same (at least they were for me, you might double check that ;) ).

benmarwick commented 8 years ago

@m-Py that's good to know about the accuracy of the two methods (I'm the author of the wordcountaddin), thanks for sharing your test results. I might drop the stringi method from the addin.

The addin will count text that are present in markdown tables in the Rmd file before the file is knit, but excluding those is on my list of things to do. It won't count tables generated by R code that only appear in the rendered document.

m-Py commented 8 years ago

@benmarwick glad I could help. The difference between the two estimates was really rather large, koRpus yielded ~ 10,000 words and stringi only estimated ~ 6,000 words. Thank you for creating this nice package, it is really of great use to me.

If you intend to include a function in your package that counts the words without necessarily using RStudio, feel free to just use the code in my gist above. I tried some code to process the text of an Rmd file so that it is apparently formatted as a text selection in RStudio, so this should work.

ebergelson commented 6 years ago

would be supercool if this could be integrated into the wordcount header in a .rmd manuscript!

crsh commented 5 years ago

Note to self: I just found a nice example of a Lua filter that counts words, which may just do the trick with some adaptations.

benmarwick commented 5 years ago

Also, a new function in wordcountaddin that might be useful here: wordcountaddin::word_count("my_file.Rmd")

This returns a single integer, so it might be handy for using in headers, etc.

crsh commented 5 years ago

Thanks for the pointer, I'll also try that and report on how they compare.

Rekyt commented 5 years ago

Hi @crsh, thank you so much for building papaja it has amazing defaults. I've found a simple workaround: putting a R field in the wordcount field:

"`r wordcountaddin::word_count('estimating_richness_sdm.Rmd')`"

minimal_reprex available here: https://gist.github.com/Rekyt/9ebda737eb7d818fdfe7981b79549a7f

crsh commented 5 years ago

I just pushed a commit that adds a first draft of the Lua-filter that counts words on the intermediate AST after citations have been rendered by pandoc-citeproc (devtools::install_github("crsh/papaja@devel"); it's based on two other Lua-filters). The filter reports the word count in the console or the R Markdown tab in RStudio.

I have compared the output for the example document in this repository to several other common approaches. This document is probably a tough one, because it contains code, verbatim output, URLs in references etc.


Lua-filter

1749 words in text body
322 words in reference section

The word count for the text body does not contain, tables or images (or their captions), or the reference section.

wordcountaddin

> wordcountaddin::word_count('example/example.Rmd')
[1] 1407

The substantial deviation here is probably largely due to the not-yet-rendered citations, of which there are several in this document.

texcount

I pasted the LaTeX code into the texcount webinterface. It reported the following counts for the text body:

Words in text: 944
Words in headers: 31
Words outside text (captions, etc.): 58
Number of headers: 8
Number of floats/tables/figures: 4
Number of math inlines: 16

and

Words in text: 400
Words in headers: 1
Words outside text (captions, etc.): 0
Number of headers: 0
Number of floats/tables/figures: 0
Number of math inlines: 0

for the reference section. The output also noted several errors related to the code and verbatim output. I think those errors may have caused texcount to ignore some bits and are probably the reason for the low word count of the text body.

wordcounter.net

Copy-pasting the text from the word document (without tables and figures) yielded the following counts:

1713 for the text body 324 for the reference section

Pages

Similarly, the Pages count (again without tables and figures) yielded

1728 for the text body 429 for the reference section


Overall I'm fairly happy with the performance of the Lua-filter. Word counting is a tricky business and none of the above methods agree. The wordcountaddin and texcount (appear to) have technical limitations with this document; wordcounter.net and Pages are in the same ballpark as the Lua-filter. I'm sure the filter can be improved (and I'll gladly take any suggestion) but I think in its current form it is a decent solution.

tdienlin commented 5 years ago

Hi Frederik, I'm sorry but I couldn't really figure how to actually run/implement the Lua-filter -- could you maybe give a brief example? And do I understand correctly, it is not possible to include the count directly into the YAML header? But might it be possible to run it in a code chunk, save the result in the cache and load it that way in the YAML header? (Hence, something like r knitr::load_cache(label = "count-words", object = "n_words")?) Thanks so much!

crsh commented 5 years ago

It's currently not possible to automatically include it, but I plan to look into ways to do this. The filter cannot be called in a code chunk because it is executed after all R code has been run and pandoc-citeproc has been applied.

If you are using the current development version of papaja (devtools::install_github("crsh/papaja@devel")), the filter should be automatically applied. The word count filter reports the word counts in the console or the R Markdown tab in RStudio, respectively, e.g.,

285 words in text body
23 words in reference section
tdienlin commented 5 years ago

Ah, I see, now I understand. Works perfectly. Thanks for the quick reply!

jooyoungseo commented 4 years ago

This is really awesome, @crsh!

Would you mind adding this word_count functionality to revision_letter output as well?

I know, we can manually put some pandoc_args for that YAML; however, it would be better if it is provided by default like apa6_pdf.

crsh commented 4 years ago

Sounds like a reasonable request. I'm a little swamped at the moment. If you'd like to try tackling this, I'd be more than happy to review a PR.

schneiderpy commented 11 months ago

Is this wordcount "problem" solved? I am using the template, but the wordcount does't work (with all default settings in the YAML) keywords : "Public policy, Crime, Paraguay, Bayesian statistics" wordcount : "X" bibliography : "bibliography.bib" floatsintext : no linenumbers : yes draft : no mask : no figurelist : no tablelist : no footnotelist : no classoption : "man" output : papaja::apa6_word editor_options: markdown: wrap: 72