andrie / sss

R package to import files in the triple-s (Standard Survey Structure) format.
http://andrie.github.io/sss/
8 stars 5 forks source link

self closing tags on values cause issues #10

Closed scottporter closed 4 years ago

scottporter commented 4 years ago

The triple-s files I get from our survey programming team sometimes have placeholder values programmed... they've reserved some values, but there is no label. Those come through in the xml as self-closing tags. This causes problems with the function getSSScodes.

I think I've convinced myself that this is actually a problem in the dependency xml2. It doesn't seem to handle the self closing tags. I've put together an example xml snippet, that if put through the relevant lines within getSSScodes reproduces the error I'm seeing.

x<-xml2::read_xml("<variable ident='412' type='single' formname='GotHere' formlabel='GotHere' formtype='single' source='p407369808' routingid='352' fieldwidth='2'>
  <name>GotHere</name>
  <label>GotHere</label>
  <position start='73553' finish='73554'/>
  <values> 
    <value code='1'>INTRO PAGE</value>
    <value code='2'>SCREENER PAGE</value>
    <value code='3'>CONTENT PAGE</value>
    <value code='4'>DEMO PAGE</value>
    <value code='5'>FOLLOW UP PAGE</value>
    <value code='6'/>
    <value code='7'/>
    <value code='8'/>
    <value code='9'/>
  </values>
  </variable>")

xx<-xml2::xml_find_all(x,"values/value")
size<-length(xx)

data.frame(ident = rep(unname(xml2::xml_attr(x, "ident")), 
                       size), code = as.character(xml2::xml_attrs(xx)), codevalues = as.character(xml2::xml_contents(xx)), 
           stringsAsFactors = FALSE)

I get the error: Error in data.frame(ident = rep(unname(xml2::xml_attr(x, "ident")), size), : arguments imply differing number of rows: 9, 5

Even worse, if the xml happened to have the right number of placeholder tags, the content could be recycled without error or warning:


x2<-xml2::read_xml("<variable ident='412' type='single' formname='GotHere' formlabel='GotHere' formtype='single' source='p407369808' routingid='352' fieldwidth='2'>
  <name>GotHere</name>
  <label>GotHere</label>
  <position start='73553' finish='73554'/>
  <values> 
    <value code='1'>INTRO PAGE</value>
    <value code='2'>SCREENER PAGE</value>
    <value code='3'>CONTENT PAGE</value>
    <value code='4'>DEMO PAGE</value>
    <value code='5'>FOLLOW UP PAGE</value>
    <value code='6'/>
    <value code='7'/>
    <value code='8'/>
    <value code='9'/>
    <value code='10'/>
  </values>
  </variable>")

xx2<-xml2::xml_find_all(x2,"values/value")
size<-length(xx2)

data.frame(ident = rep(unname(xml2::xml_attr(x2, "ident")), 
                       size), code = as.character(xml2::xml_attrs(xx2)), codevalues = as.character(xml2::xml_contents(xx2)), 
           stringsAsFactors = FALSE)

Produces the following dataframe with recycled labels:

   ident code     codevalues
1    412    1     INTRO PAGE
2    412    2  SCREENER PAGE
3    412    3   CONTENT PAGE
4    412    4      DEMO PAGE
5    412    5 FOLLOW UP PAGE
6    412    6     INTRO PAGE
7    412    7  SCREENER PAGE
8    412    8   CONTENT PAGE
9    412    9      DEMO PAGE
10   412   10 FOLLOW UP PAGE
scottporter commented 4 years ago

OK, I don't have much experience with the xml2 package, but I poked around, and think I found a solution.

xml2 has another function called xml_text and its behavior with self-closing tags is what we would hope for. It returns a text representation of the content, but that seems appropriate for the getSSScodes function.

Here is my test case, subbing in xml_text for xml_contents:

x<-xml2::read_xml("<variable ident='412' type='single' formname='GotHere' formlabel='GotHere' formtype='single' source='p407369808' routingid='352' fieldwidth='2'>
  <name>GotHere</name>
  <label>GotHere</label>
  <position start='73553' finish='73554'/>
  <values> 
    <value code='1'>INTRO PAGE</value>
    <value code='2'>SCREENER PAGE</value>
    <value code='3'>CONTENT PAGE</value>
    <value code='4'>DEMO PAGE</value>
    <value code='5'>FOLLOW UP PAGE</value>
    <value code='6'/>
    <value code='7'/>
    <value code='8'/>
    <value code='9'/>
  </values>
  </variable>")

xx<-xml2::xml_find_all(x,"values/value")
size<-length(xx)

data.frame(ident = rep(unname(xml2::xml_attr(x, "ident")), 
                       size), code = as.character(xml2::xml_attrs(xx)), codevalues = as.character(xml2::xml_text(xx)), 
           stringsAsFactors = FALSE)

and results:

> x<-xml2::read_xml("<variable ident='412' type='single' formname='GotHere' formlabel='GotHere' formtype='single' source='p407369808' routingid='352' fieldwidth='2'>
...   <name>GotHere</name>
...   <label>GotHere</label>
...   <position start='73553' finish='73554'/>
...   <values> 
...     <value code='1'>INTRO PAGE</value>
...     <value code='2'>SCREENER PAGE</value>
...     <value code='3'>CONTENT PAGE</value>
...     <value code='4'>DEMO PAGE</value>
...     <value code='5'>FOLLOW UP PAGE</value>
...     <value code='6'/>
...     <value code='7'/>
...     <value code='8'/>
...     <value code='9'/>
...   </values>
...   </variable>")
> 
> xx<-xml2::xml_find_all(x,"values/value")
> size<-length(xx)
> data.frame(ident = rep(unname(xml2::xml_attr(x, "ident")), 
...                        size), code = as.character(xml2::xml_attrs(xx)), codevalues = as.character(xml2::xml_text(xx)), 
...            stringsAsFactors = FALSE)
  ident code     codevalues
1   412    1     INTRO PAGE
2   412    2  SCREENER PAGE
3   412    3   CONTENT PAGE
4   412    4      DEMO PAGE
5   412    5 FOLLOW UP PAGE
6   412    6               
7   412    7               
8   412    8               
9   412    9  

I also hacked this change into the getSSScodes function of my local copy of the repo and tried importing the triple-s xml files that have been giving me errors, and they imported without a problem.

andrie commented 4 years ago

Will you consider submitting your changes as a pull request? I'm getting ready to submit a new version to CRAN (with minor changes only), so now would be a good time to contribute a fix.

andrie commented 4 years ago

I think i have accidentally fixed this as part of #9 (currently in the dev branch).

"Accidentally", since I was fixing a different problem, but that also required using xml_text().

andrie commented 4 years ago

This fix is now in the master branch on github. Please can you test on as many survey files as possible and report any problems? I'll submit a new version to CRAN early next week if I don't get any error reports.

scottporter commented 4 years ago

I wasn't getting the thread notifications on comments, but I did get one on close. I will test out the revised code and let you know what I see... thanks!

scottporter commented 3 years ago

It took me a long time to get to this, but you are correct, your fix solved this issue with reading the metadata files I've been testing with. In my testing, I'm realizing I'm having trouble reading the associated data files... but that's a completely different issue. After I figure out what's going on there, I'll open a new issue or feature request.

andrie commented 3 years ago

Thanks for reporting, @scottporter .