conig / revise

R package for writing revise and resubmits
Other
3 stars 2 forks source link

improved parsing function using recursion #15

Closed cjvanlissa closed 1 year ago

cjvanlissa commented 1 year ago

This closes #14 .

I am submitting a different parsing function that I believe should be robust to all current use cases. To make sure, I wrote unit tests using all the examples I could find in closed issues. This new function also accommodates multi-line sections. It does not use group matching anymore, which means you don't have to worry about weird uses of parentheses. Instead, it uses a recursive function to extract nested parenthesis groups.

My function is also faster:

 expr
 extract_md_sections2("[blabala]{#first} and also [dadada da da]{#second}")
 extract_md_sections4("[blabala]{#first} and also [dadada da da]{#second}")
    min     lq     mean  median      uq    max neval cld
 1108.8 1266.7 1374.344 1336.15 1427.20 2234.2   100   b
  282.7  334.2  474.947  376.05  426.35 8430.9   100  a 

(the bottom one is my new function).

If this new parsing function indeed makes a useful contribution, I hope you will consider accepting it, and perhaps consider adding me as a contributor. I'm happy to keep contributing in the future too, this work is very compatible with my work on reproducibility and I would love to promote it in presentations and courses. Here is some of my other reproducibility-oriented work: install.packages("worcs") and remotes::install_github("aaronpeikert/repro") All the best, Caspar

conig commented 1 year ago

Thank you very much for this contribution, I'd be thrilled to have you on board. Your work with {worcs} and {repro} looks valuable and I'm looking forward to trying them both out.

I really appreciate how careful you have been with ensuring your revised function supports closed tickets, and for writing unit tests (which I've regrettably not gotten around to doing myself).

I want to flag one possible negative side effect with supporting multi-line md tagging before I merge this in. While everything appears as desired in the revision letter, the anchor tags appear in the original manuscript. For example take this doc:

---
output            : pdf_document
---

# Span method

<span id = "spanmethod">

First line

Second line

</span>

# MD method

[

First line

Second line]{#mdmethod}

While the span method remains invisible, the md version leaves the anchor tags visible.

image

For this reason I have always used spans to handle multi-line sections rather than md anchors. Is this is a problem for your use case?

We might need to include a warning to users who might otherwise not notice the appearance of tags in their manuscripts. What do you think?

cjvanlissa commented 1 year ago

Thank you for your positive response! Interestingly, if I render your example here, the anchor tags do not appear in the PDF - but neither does the text between square brackets. And when I run my function on an actual manuscript with multi-line comments, everything works as expected (sections are in the manuscript without brackets, and they are parsed by read_manuscript).

I suspect that different rendering functions may have different behaviors for square brackets, something to look into. One very extreme solution is using a custom knit function, as I did with worcs::cite_essential. Ideally, that would be avoided though.

Regardless of the knitting of the text, I do think a recursive search is the most robust futureproof way to extract the sections. So maybe that can be combined with what is already there.

You could of course support both square brackets and span tags, and use recursive search to extract both, then merge the data.frame with tags and sections. Then you can have nested span tags, even.

cjvanlissa commented 1 year ago

I've adapted the same function so that it also parses (nested) spans. We could run this function twice, once with is_span = TRUE and once with is_span = FALSE, and then rbind the resulting data.frames

conig commented 1 year ago

Rendering differences

If I render your example here, the anchor tags do not appear in the PDF - but neither does the text between square brackets. And when I run my function on an actual manuscript with multi-line comments, everything works as expected (sections are in the manuscript without brackets, and they are parsed by read_manuscript).

I suspect that different rendering functions may have different behaviors for square brackets, something to look into.

Very odd that we would get different results for the same rendering function. What version of pandoc are you on? I'm on:

> rmarkdown::pandoc_version()
# [1] ‘2.19.2’

One very extreme solution is using a custom knit function, as I did with worcs::cite_essential. Ideally, that would be avoided though.

I would prefer to avoid custom rendering if possible as it would force users to choose to be in our ecosystem. Currently we can support any markdown document rendered by pandoc. E.g., quarto as well as rmd.

As an aside I get the same issue with quarto which makes me think our discrepancy comes down to our pandoc versions:

image

You could of course support both square brackets and span tags, and use recursive search to extract both, then merge the data.frame with tags and sections. Then you can have nested span tags, even.

You are very quick. I'm impressed how quickly you got that regex off the ground. We actually have a separate function that handles spans written by @tarensanders.

Hidden in this file (we need a reorganise) https://github.com/conig/revise/blob/main/R/read_manuscript.R

extract_html_sections

Currently read_manuscript tries to extract md and html, and then appends them.

So if we call the following file "test_manuscript.qmd" With contents

---
format: "pdf"
---

# Span method

<span id = "spanmethod">

First line

Second line

</span>

# MD method

[

First line

Second line

]{#mdmethod}

Then the following is extracted with read manuscript:

manuscript <- revise::read_manuscript("test_manuscript.qmd", PDF = FALSE)
manuscript$sections
#> $mdmethod
#> [1] "\nFirst line\nSecond line"
#> 
#> $spanmethod
#> [1] "\n\nFirst line\n\nSecond line\n\n"

So I think this functionality is already handled, but let me know if I've missed something.

Updated md function

Regardless of the knitting of the text, I do think a recursive search is the most robust futureproof way to extract the sections. So maybe that can be combined with what is already there.

Absolutely, I agree that your function is superior. I think we should merge in your commit (without the span support).

In addition, I'll create a warning when the md anchors are used on multi-lines that for some versions of pandoc md glyphs will be left within the output manuscript.

Are you happy with these courses of action?

tarensanders commented 1 year ago

Wow, this looks great. I'm a big fan of having one sentence per line for version control, so this would be a welcome addition IMO.

cjvanlissa commented 1 year ago

Dear @conig , thank you for following up! And @tarensanders thank you for the vote of confidence.

Before determining next couse of action, let me try to iron out the hurdles you mentioned.

Indeed, span functionality is already there (although I did miss it at first). Now it's fully handled by my parsing function too. The reasons I think it's worth considering to use the same parsing function for both square brackets and spans is: Consistent/predictable functionality across both types of tags, only one function that needs to be maintained going forward, and losing dependencies on (at least) rvest and xml2, thus making the package more lightweight (= easier dependency management for reproducible projects). Also, depending fully on base R for parsing is more futureproof than depending on two other packages with high-level functions that may be subject to breaking changes.

To be able to properly test whether my proposed function can meet specifications, I have updated read_manuscript() in my fork to use only my parsing functions. I'm expanding the test suite to, e.g., check on different systems whether brackets end up in the final PDF or not. I'll also run it on different systems with different versions of pandoc, until I understand what the problem is.

cjvanlissa commented 1 year ago

Brackets remaining in the rendered PDF is indeed a pandoc problem. I was on pandoc version 2.18, which is bundled with Rstudio. That version does not leave brackets in the PDF. After installing version 2.19.2, however, I do get brackets in the text.

To me, that seems like a pure pandoc issue that will need to be dealt with by the Rstudio and Rmarkdown crew. Yihui notes here that newer versions of pandoc than those bundled with Rstudio are likely to result in some problem behavior: https://bookdown.org/yihui/rmarkdown-cookbook/install-pandoc.html

Having determined that this is not a problem of the parsing function, I see two ways to resolve it:

  1. Require pandoc version == 2.18
  2. Give the warning about brackets if pandoc version > 2.18
conig commented 1 year ago

Now it's fully handled by my parsing function too. The reasons I think it's worth considering to use the same parsing function for both square brackets and spans is: Consistent/predictable functionality across both types of tags, only one function that needs to be maintained going forward, and losing dependencies on (at least) rvest and xml2, thus making the package more lightweight (= easier dependency management for reproducible projects). Also, depending fully on base R for parsing is more futureproof than depending on two other packages with high-level functions that may be subject to breaking changes.

Excellent arguments, I'm convinced. I also really appreciate the cleaning up of dependencies etc that you've been doing. If interested, please add yourself as an author in the description file.

I think crucial to figure out which anchor behaviour is intended. See discussion in issue #18. It

From the pandoc manual it looks as though []{#} was only meant to work for inline tagging. In which case future versions of pandoc may not bring back the ability for multi-line tagging with that syntax. As rstudio regularly updates their bundled version of pandoc, the days of 2.18 are probably numbered. For that reason we'd want at least a warning for >2.18 as you suggest. In the referenced issue I discuss potentially supporting fenced div blocks which is documented in the pandoc manual and likely to be more stable. Of course spans still work fine for this situation and are flexible. We actually started only supporting spans, but I added support for []{#} as an old version of rstudio forcibly replaced spans with []{#} when you toggled the visual editor.

In addition, one final issue to also iron out. Take the following qmd:

---
format: "pdf"
---

<span id = "SpanTest">

## Heading

Maecenas mollis consectetur purus. Ut ultrices metus in mauris congue ultricies. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae; Integer pulvinar non nisi in tristique. Nam euismod nibh et mauris bibendum pellentesque.

Suspendisse vulputate, lacus vel finibus placerat, tortor nulla fermentum ipsum, sit amet facilisis mauris velit at lectus. Nam quis libero eget eros luctus vehicula a sit amet quam. Morbi varius augue a augue posuere, vitae tempor tellus scelerisque. Fusce sed rhoncus felis. Morbi lorem odio, egestas at mollis nec, bibendum ac magna.

</span>

Previously the span tagging worked fine for extracting multi-paragraphs

image

However on the latest commit such sections are assumed to be part of a single paragraph. This makes sense when each sentence gets a separate line as you and Taren advocate for. But it would be good to support those not using this writing method. I think an argument could be used to control this behaviour. IMO the default should result in the legacy behaviour (linebreaks assumed to be paragraph breaks) as most people write that way, and to ensure that existing docs don't need to be re-written .

See below for behaviour in: revise@ac1b1012c29c656fefd7ee2cd4ada59d5cd3fd05

image

cjvanlissa commented 1 year ago

Good to hear! I agree with your points. Next week, I'll dig into the example you posted, and I'll also make the syntax for parsing <span>s a little more robust.

cjvanlissa commented 1 year ago

I can easily preserve the capturing of multi-paragraph spans, however, if the span tag is on a separate line, the section gains an extra empty line. This is not a bug, because after the span tag comes a newline. However, it is also not intuitive. Using trimws() removes extra newlines, but also all intentional newlines.

I think no further action is required here; it's just important to understand why the parsed section begins with two newlines instead of one.

conig commented 1 year ago

Agreed, and the extra newlines shouldn't cause issues as pandoc will ignore them.

This all looks great, thanks for your work on this one. Merging in.