Consolidating metadata for a collection of files

fBedecarrats commented 6 months ago

Congratulations and many thanks for this great tool @dusadrian ! I undestand how to use the convert function to produce DDI files with a one to one correspondence between source files (in Stata for instance) and xml files. But I can't figure out how to consolidate it. Here is one reproducible example using the survy models proposed by DHS:


# Install the latest version of DDIwR
remotes::install_github("https://github.com/dusadrian/DDIwR")
library(DDIwR)
library(tidyverse)

# Set our variables to acquire the data
dhs_dir <- "test/"
dhs_models <- c("zzbr62dt.zip", # Births Recode
                "zzcr61dt.zip", # Couples' Recode
                "zzhr62dt.zip", # Household Recode
                "zzir62.zip") # Individual Recode
dhs_base_url <- "https://www.dhsprogram.com/data/model_data/dhs/"

# Acquire the data
dir.create(dhs_dir)
dhs_urls <- paste0(dhs_base_url, dhs_models)
dhs_dest <- paste0(dhs_dir, dhs_models)
map2(dhs_urls, dhs_dest, download.file) # Download
map(dhs_dest, unzip, exdir = dhs_dir) # Unzip
stata_files <- list.files(dhs_dir, pattern = "\\.DTA$") # List data files

Here I have 4 stata files that correspond to different questionnaire sections or different formatting of the same data. I would like to make a consolidated DDI file out of them. Here are two questions: How can I use DDIwR to convert them to children of a parent common object? How can use DDIwR to add general metadata to document the Overview, scope & coverage, sampling... and other attributes common to all the files? Thanks in advance for your feedback.

dusadrian commented 6 months ago

Hello Florent,

I only got three .DTA files using your script, but the question is still the same. In principle, it would be difficult to say how to integrate without knowing more about your datasets, but this is something out of the immediate scope of the DDIwR package. The so-called "Codebook" variant of the DDI is intended to document individual datasets (one at a time). Now, if all of these datasets are part of the same study, there are two options, function of the particular situation of your research:

If the datasets can be combined (merged) into a single one (for instance I see hhid which I assume is the household ID) then I would try to merge them in R, then convert the resulting R dataframe into a consolidated XML file. The command is still the same, something like convert(finalRdata, to = "path/to/ddi.xml")
If the datasets cannot be merged because they really are supposed to be separate (despite from the same study), which I believe it is the case, then:
- you can generate separate XML files and merge them manually (the DDI elements are repeatable) into a suitable text or XML editor
- you could also read the metadata from each individual Stata file into R, create the DDI elements (also in R) and save the final R version of the DDI Codebook into an XML file on the disk

This version of the DDIwR package contains all elements from the DDI Codebook 2.6, which you can browse (see for instance ?showDetails) to learn about the structure of these elements, which can be created (see ?makeElement) and added to parent elements (see ?addChildren) and there are more such useful commands in the manual.

I tried to play with your files and for the moment I am getting errors (don't yet know why, but I will investigate). Hope this helps to get your going, at least for the moment, Adrian

fBedecarrats commented 6 months ago

Hello Adrian, thank you for your reply. I was refering to the demographic and health surveys in my example, because it is to my knowledge the standardized household survey that is the most widely used around the world (in more than 90 countries), because the DHS program provide "mock-up" survey datasets for tests (downloaded in the reproducible example above) and because the DDIs produced with these survey are used by many online catalogues (NADA or others), such as the International Household Survey Network. See for instance a recent DHS survey entry on IHSN catalogue that was created with a DDI Codebook 2.5 (hundreds others can be found by searching "DHS" on this catalog). The DDI codebook can be downloaded here, but it seems to have a different structure than what I get from DDIwR: We have a docDscr, a stdyDscr, one fileDscr per stata file, and a dataDscr that includes one entry per variables and a files variable that refers to the ID of one of the fileDscr. I can try to figure it out myself, but I think that it would serve common use cases to provide some guidance on how to prepare a multifile DDI with your package. I think that it would also be useful to have some handy functions to populate the docDscr and stdyDscr sections.

dusadrian commented 16 hours ago

Returning a bit to this issue, it is still open and will likely stay open for a "little" while. It requires me writing out a guide (either intro, as in getting started, or probably that plus more advanced topics).

But there actually are handy functions to populate docDscr and stdyDscr. In fact, the latest functions allows one to write the entire codeBook using these functions, see for instance: ?makeElement ?addChildren ?addAttributes ?addContent etc.

The DDI Codebook elements are standard, so the structure of the XML file produced by DDIwR has to be compatible (impossible not) to the IHSN files. The reason why they seem different must be the fact that IHSN codebook files are completely documented, while the ones (automatically) produced by DDIwR thoroughly document the variables in the dataDscr element, but there is no other information about the study. The other elements of the Codebook have to be created manually (using the above commands), or using a script that make use of these commands to populate the Codebook from a database.

dusadrian / DDIwR

Consolidating metadata for a collection of files #4