dantonnoriega / xmltools

Tools to look at xml data. Has functions similar to the `tree` command line tool ( xml_view_tree). Allows one to find paths quickly, including just terminal node paths (xml_get_paths). Also has two functions for helping convert xml code to data frames (xml_to_df and xml_dig_df).
25 stars 4 forks source link

issues with duplicated or missing names #4

Open ThomasGro opened 5 years ago

ThomasGro commented 5 years ago

I want to extract data from a big and deep XML file. I followed your example 2 workflows with library(xml2) and it worked to generate a list of terminal nodes and xpaths. I can also run the "xml_dig_df" value extraction step. But the transformation of the data to a dataframe failed.

purrr::map(dplyr::bind_rows) => throws error "Argument 31 must have names"

Also, if I run the library(xml) workflow, setnames throws an error

"Can't assign 1 names to a 113576 column data.table"

traceback() 9: stop("Can't assign ", length(old), " names to a ", ncol(x), " column data.table") 8: setnames(x, value) 7: names<-.data.table(*tmp*, value = fields) 6: names<-(*tmp*, value = fields) 5: FUN(X[[i]], ...) 4: lapply(terminal_xpaths, xml_to_df, file = "cellosaurus.xml", is_xml = FALSE, dig = FALSE) 3: eval(lhs, parent, parent) 2: eval(lhs, parent, parent) 1: lapply(terminal_xpaths, xml_to_df, file = "cellosaurus.xml", is_xml = FALSE, dig = FALSE) %>% dplyr::bind_cols()

Is there a way to dig into the nested list of the purrr::map(dplyr::bind_rows) output to find our where/what the issue is ?

thank you Thomas

dantonnoriega commented 1 year ago

resolved by #8 I think? @lecy did you get similar errors?

lecy commented 1 year ago

It looks like it might be. I think the pull request just needs to be merged for the updates to take effect, then you can test it out.

The default argument in tibble() was to check whether names are unique but not repair them if they weren't. I changed it to make names unique. I suspect that would address the error but there is no reproducible example so not 100% sure.

as_tibble( .name_repair = "unique" )

.name_repair argument: Treatment of problematic column names:

"minimal": No name repair or checks, beyond basic existence,
"unique": Make sure names are unique and not empty,
"check_unique": (default value), no name repair, but check they are unique,
"universal": Make the names unique and syntactic