Bioconductor / sweave2rmd

A project for converting Bioconductor Sweave documents to Rmd
Creative Commons Attribution Share Alike 4.0 International
8 stars 20 forks source link

Develop Lua filters for common LaTeX macros not handled by pandoc directly #35

Open mtmorgan opened 1 year ago

mtmorgan commented 1 year ago

Following on https://github.com/Bioconductor/sweave2rmd/issues/34, This StackOverflow post shows how to write a Lua filter; a set of these might be developed for the BiocStyle macros as a kind of 'meta' resource for this project.

This

return {
  {
    RawInline = function (raw)
      local macro = raw.text:match '\\R{}'
      if raw.format == 'latex' and macro then
        return pandoc.RawInline('markdown', '_R_')
      end
    end
  }
}

would replace the Rnw macro \R{} with the markdown _R_ and if in a file BiocStyle-Rnw-to-Rmd.lua would be used as

pandoc -f latex+raw_tex -t markdown file.Rnw --lua-filter BiocStyle-Rnw-to-Rmd.lua -o file.Rmd

The next macros to tackle are likely \CRANpkg{<package name>} and \Biocpkg{<package name>} which translate to markdown links [<package name>](https://cran.r-project.org/package=<package name>) and [<package name>](https://bioconductor.org/packages/<package name> followed by \Rcode{<inline code>} translated to `<inline code>`. I think Sweave code chunks <<...>>= ... @ could also be translated automatically

jwokaty commented 1 year ago

Thank you again. This is exactly where we wanted to go! I think this would be an interesting task for our future Outreachy fellow.

LiNk-NY commented 1 year ago

Thanks Martin, @mtmorgan

Here is a working filter that I was able to come up with. The language is a bit unwieldy and I'm a novice :)

function RawInline (raw)
    local formula = raw.text:match '\\Rpackage{(.*)}'
    if raw.format == 'latex' and formula then
        return pandoc.RawInline('markdown', '`r Biocpkg(' .. formula .. ')`')
    end

    local formula = raw.text:match '\\Robject{(.*)}'
    if raw.format == 'latex' and formula then
        return pandoc.RawInline('markdown', '`' .. formula .. '`')
    end

    local formula = raw.text:match '\\Rfunction{(.*)}'
    if raw.format == 'latex' and formula then
        return pandoc.RawInline('markdown', '`' .. formula .. '`')
    end
end
mtmorgan commented 1 year ago

It would probably be helpful to come up with a test Rnw document and corresponding expected Rmd document, with one line per LaTeX 'test' --> corresponding Rmd. I tweaked your code & my code a bit

return {
  {
    RawInline = function (raw)
      local macro = raw.text:match '\\R{}'
      if raw.format == 'latex' and macro then
        return pandoc.RawInline('markdown', '*R*')
      end

      local macro = raw.text:match '\\R$'
      if raw.format == 'latex' and macro then
        return pandoc.RawInline('markdown', '*R*')
      end

      local formula = raw.text:match '\\Bioconductor{}'
      if raw.format == 'latex' and formula then
         return pandoc.RawInline('markdown', '*Bioconductor*')
      end

      local formula = raw.text:match '\\CRANpkg{([^}]*)}'
      if raw.format == 'latex' and formula then
         return pandoc.RawInline('markdown', '`r CRANpkg(' .. formula .. ')`')
      end

      local formula = raw.text:match '\\Biocpkg{([^}]*)}'
      if raw.format == 'latex' and formula then
         return pandoc.RawInline('markdown', '`r Biocpkg(' .. formula .. ')`')
      end

      local formula = raw.text:match '\\Githubpkg{([^}]*)}'
      if raw.format == 'latex' and formula then
         return pandoc.RawInline('markdown', '`r Githubpkg(' .. formula .. ')`')
      end

      local formula = raw.text:match '\\Rpackage{([^}]*)}'
      if raw.format == 'latex' and formula then
         return pandoc.RawInline('markdown', '`' .. formula .. '`')
      end

      local formula = raw.text:match '\\Robject{(.*)}'
      if raw.format == 'latex' and formula then
         return pandoc.RawInline('markdown', '`' .. formula .. '`')
      end

      local formula = raw.text:match '\\Rcode{(.*)}'
      if raw.format == 'latex' and formula then
         return pandoc.RawInline('markdown', '`' .. formula .. '`')
      end

      local formula = raw.text:match '\\software{(.*)}'
      if raw.format == 'latex' and formula then
         return pandoc.RawInline('markdown', '`' .. formula .. '`')
      end

      local formula = raw.text:match '\\file{(.*)}'
      if raw.format == 'latex' and formula then
         return pandoc.RawInline('markdown', '`' .. formula .. '`')
      end

      local formula = raw.text:match '\\Rfunction{(.*)}'
      if raw.format == 'latex' and formula then
         return pandoc.RawInline('markdown', '`' .. formula .. '`')
      end
    end
  }
}

to translate

The \R{} programming language

\R\ is a programming language.

The name of one programming language is simply \R.

\Biocpkg{BiocStyle} is a \Bioconductor{} package.

The \CRANpkg{knitr} is used to create markdown vignettes.

Sometimes packages, like \Githubpkg{AnVILAz} are only found on Github.

The \R{} package \Rpackage{foo} is not found in any common repository

\software{samtools} is pretty important in Bioinformatics...

\Robject{mtcars} is a \Rcode{data.frame}.

\Rfunction{data.frame} is a function used to create a \Rcode{data.frame}.

\Rfunction{data.frame()} is a function used to create a \Rcode{data.frame}.

Sometimes inline \R{} code \Rcode{x <-
1 + 1} can span two lines.

to get something that is mostly correct(?)

The *R* programming language

*R* is a programming language.

The name of one programming language is simply *R*.

`r Biocpkg(BiocStyle)` is a *Bioconductor* package.

The `r CRANpkg(knitr)` is used to create markdown vignettes.

Sometimes packages, like `r Githubpkg(AnVILAz)` are only found on
Github.

The *R* package `foo` is not found in any common repository

`samtools` is pretty important in Bioinformatics\...

`mtcars` is a `data.frame`.

`data.frame` is a function used to create a `data.frame`.

`data.frame()` is a function used to create a `data.frame`.

Sometimes inline *R* code `x <-
1 + 1` can span two lines.

As you note, probably there are much better ways of implementing the Lua code, which is highly repetitive now! Also, maybe we could start a Lua repository that might start to follow better practices (than an issue thread!) for Lua development...

jwokaty commented 1 year ago

@mcarlsn @villafup @BerylKanali It might be that you've noticed things that we repeatedly have to manually edit to get it in the right format. It might good to start documenting that here, so that we can make sure those cases are included. I agree with @mtmorgan that it would be nice to come up with a test .Rnw. Maybe @BerylKanali can help with this given some guidance?

mtmorgan commented 1 year ago

@jwokaty perhaps it makes sense to create a lua branch and add an inst/lua directory with progress so far? I've iterated a bit on @LiNk-NY 's work, and things look pretty promising. Definitely @BerylKanali could help with the test Rnw file!