Closed scottporter closed 4 years ago
OK, I don't have much experience with the xml2 package, but I poked around, and think I found a solution.
xml2 has another function called xml_text
and its behavior with self-closing tags is what we would hope for. It returns a text representation of the content, but that seems appropriate for the getSSScodes
function.
Here is my test case, subbing in xml_text
for xml_contents
:
x<-xml2::read_xml("<variable ident='412' type='single' formname='GotHere' formlabel='GotHere' formtype='single' source='p407369808' routingid='352' fieldwidth='2'>
<name>GotHere</name>
<label>GotHere</label>
<position start='73553' finish='73554'/>
<values>
<value code='1'>INTRO PAGE</value>
<value code='2'>SCREENER PAGE</value>
<value code='3'>CONTENT PAGE</value>
<value code='4'>DEMO PAGE</value>
<value code='5'>FOLLOW UP PAGE</value>
<value code='6'/>
<value code='7'/>
<value code='8'/>
<value code='9'/>
</values>
</variable>")
xx<-xml2::xml_find_all(x,"values/value")
size<-length(xx)
data.frame(ident = rep(unname(xml2::xml_attr(x, "ident")),
size), code = as.character(xml2::xml_attrs(xx)), codevalues = as.character(xml2::xml_text(xx)),
stringsAsFactors = FALSE)
and results:
> x<-xml2::read_xml("<variable ident='412' type='single' formname='GotHere' formlabel='GotHere' formtype='single' source='p407369808' routingid='352' fieldwidth='2'>
... <name>GotHere</name>
... <label>GotHere</label>
... <position start='73553' finish='73554'/>
... <values>
... <value code='1'>INTRO PAGE</value>
... <value code='2'>SCREENER PAGE</value>
... <value code='3'>CONTENT PAGE</value>
... <value code='4'>DEMO PAGE</value>
... <value code='5'>FOLLOW UP PAGE</value>
... <value code='6'/>
... <value code='7'/>
... <value code='8'/>
... <value code='9'/>
... </values>
... </variable>")
>
> xx<-xml2::xml_find_all(x,"values/value")
> size<-length(xx)
> data.frame(ident = rep(unname(xml2::xml_attr(x, "ident")),
... size), code = as.character(xml2::xml_attrs(xx)), codevalues = as.character(xml2::xml_text(xx)),
... stringsAsFactors = FALSE)
ident code codevalues
1 412 1 INTRO PAGE
2 412 2 SCREENER PAGE
3 412 3 CONTENT PAGE
4 412 4 DEMO PAGE
5 412 5 FOLLOW UP PAGE
6 412 6
7 412 7
8 412 8
9 412 9
I also hacked this change into the getSSScodes
function of my local copy of the repo and tried importing the triple-s xml files that have been giving me errors, and they imported without a problem.
Will you consider submitting your changes as a pull request? I'm getting ready to submit a new version to CRAN (with minor changes only), so now would be a good time to contribute a fix.
I think i have accidentally fixed this as part of #9 (currently in the dev branch).
"Accidentally", since I was fixing a different problem, but that also required using xml_text()
.
This fix is now in the master branch on github. Please can you test on as many survey files as possible and report any problems? I'll submit a new version to CRAN early next week if I don't get any error reports.
I wasn't getting the thread notifications on comments, but I did get one on close. I will test out the revised code and let you know what I see... thanks!
It took me a long time to get to this, but you are correct, your fix solved this issue with reading the metadata files I've been testing with. In my testing, I'm realizing I'm having trouble reading the associated data files... but that's a completely different issue. After I figure out what's going on there, I'll open a new issue or feature request.
Thanks for reporting, @scottporter .
The triple-s files I get from our survey programming team sometimes have placeholder values programmed... they've reserved some values, but there is no label. Those come through in the xml as self-closing tags. This causes problems with the function
getSSScodes
.I think I've convinced myself that this is actually a problem in the dependency
xml2
. It doesn't seem to handle the self closing tags. I've put together an example xml snippet, that if put through the relevant lines withingetSSScodes
reproduces the error I'm seeing.I get the error:
Error in data.frame(ident = rep(unname(xml2::xml_attr(x, "ident")), size), : arguments imply differing number of rows: 9, 5
Even worse, if the xml happened to have the right number of placeholder tags, the content could be recycled without error or warning:
Produces the following dataframe with recycled labels: