elipousson / idmlr

A R package to read InDesign Markup Language (IDML) files
https://elipousson.github.io/idmlr/
Other
3 stars 0 forks source link

Add utility functions for converting nested `xml_document` objects to lists/data frames #3

Open elipousson opened 1 year ago

elipousson commented 1 year ago

I started looking at this as an example: https://rud.is/rpubs/xml2power/

Here is the code I started working on based on that blog post:

xtrct_text <- function(doc, target) {
  xml2::xml_find_all(doc, target) |>
    xml2::xml_text() |>
    trimws()
}

xtrct_attr <- function(doc, target) {
  xml2::xml_find_all(doc, target) |>
    xml2::xml_attrs()
}

xtrct_df <- function(doc, top, type = "attr") {
  xml_children <- xml2::xml_find_first(doc, sprintf(".//%s", top)) |>
    xml2::xml_children()

  xml_children |>
    xml2::xml_name() |>
    purrr::map(
      function(x) {
        content <- switch(type,
                          attr = as.list(xtrct_attr(doc, sprintf(".//%s/%s", top, x))),
                          text = xtrct_text(doc, sprintf(".//%s/%s", top, x))
        )

        content
        # rlang::set_names(
        #   list(content),
        #   tolower(x)
        # )
      }
    ) # |>
  # purrr::flatten_df() #|>
  # readr::type_convert()
}

I need to review the IDML specs to figure out when/what info is stored in attributes or tag names and when/what info is stored in text and what flags I can use to determine how deeply nested the XML structure is for a given node.

elipousson commented 1 year ago

Making progress on this although I need to tweak what level of detail the data.frame includes and whether it would be better to include nested data.frame list columns to avoid duplicative rows. get_idml_spreads() for example returns multiple rows for each spread because it returns a data frame with the information contained in the xml document for each spread. There probably should be a list_idml_spreads() and get_idml_spread() singular function to go along with the plural version.