DesiQuintans / tsv2label

tsv2label: Label, describe, rename, and recode datasets using a data dictionary
Other
2 stars 0 forks source link

Add interoperability with `sift` dictionaries? #1

Closed DesiQuintans closed 1 year ago

DesiQuintans commented 1 year ago

One of my other packages, sift, helps me find the right variable in a dataframe of hundreds of columsn by letting me fuzzily search through the dataframe's column names, variable labels, factor labels, and unique values. It generates a dictionary for each dataset that it uses to do these searches, and it looks like this:

> library(sift)
> dict <- sift(iris)
ℹ Building dictionary for 'iris'. This only happens when it changes.
✔ Dictionary was built in 0.01 secs.

ℹ Dictionary has 5 columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species.

> dplyr::glimpse(dict)
Rows: 5
Columns: 13
$ colnum      <int> 1, 2, 3, 4, 5
$ varname     <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"
$ var_lab     <chr> "", "", "", "", ""
$ rand_unique <chr> "5.1, 6.8, 4.7, 7.9, 7.1, 6.3, 4.5, 5.2, 7.6, 5.6, 4.6, 7.3, 7.4, 6.2, 6, 4.4, …
$ pct_miss    <dbl> 0, 0, 0, 0, 0
$ pct_nonmiss <dbl> 100, 100, 100, 100, 100
$ type_str    <chr> "double", "double", "double", "double", "factor ×3"
$ all_same    <chr> "No", "No", "No", "No", "No"
$ val_lab     <chr> "NULL", "NULL", "NULL", "NULL", "NULL"
$ fct_lvl     <chr> "NULL", "NULL", "NULL", "NULL", "c(\"setosa\", \"versicolor\", \"virginica\")"
$ class       <chr> "\"numeric\"", "\"numeric\"", "\"numeric\"", "\"numeric\"", "\"factor\""
$ type        <chr> "\"double\"", "\"double\"", "\"double\"", "\"double\"", "\"integer\""
$ haystack    <chr> "Sepal.Length 5.1, 6.8, 4.7, 7.9, 7.1, 6.3, 4.5, 5.2, 7.6, 5.6, 4.6, 7.3, 7.4, …
> 

This is close to being usable by tsv2label; if varname was changed to name and tsv2label was told to look for a column called fct_lvl and use it as deparsed code, then it would work.

I don't know how useful this actually is, though. The reason why tsv2label is useful is because datasets from foreign packages need type conversion and relabelling to get them into R. Once the dataset is in R it can be labelled, manipulated, and then the exact object can be saved with save() or saveRDS(), and that can be distributed to other R users.

I suppose it's useful if you want to redistribute the dataset as a .CSV or some other accessible format? That would be a good enough reason. Alternatively, it would be useful if you had to keep the data in such a format, like if you were particularly dependent on vroom reading a very large CSV.

DesiQuintans commented 1 year ago

Assumptions of this interoperability

  1. The dataset being labelled has been exported from R.
    • If the dataset was a foreign dataset, then you wouldn't have been able to build a sift dictionary for it in the first place: You would have built a dictionary yourself from external resources and used tsv2label's normal workflow.
    • The sift dictionary reflects the state of the data as it exists in R. For example, if you build a dictionary for poker without modifying it, then its dictionary will show that columns C1:CLASS are numeric. But if you run factorise_with_dictionary() to convert those to factors, then the updated sift dictionary will show that it is a factor and store its levels/labels.

Implications of this interoperability

  1. R discards the original column contents when it converts to Factor. It stores Factor internally as integers 1:n, with labels for each. a. If using a sift dictionary to label a foreign dataset, the levels will be sort(unique(df$col)), and the labels in the dictionary will be applied in that order (this is default R behaviour). b. If using a sift dictionary on a dataset that has been exported from R, then its factors will be exported as labels, and I can feed fct_lvl into both levels = and labels = args to return the factor in its expected order.
DesiQuintans commented 1 year ago

Changes to be made on the tsv2label side:

DesiQuintans commented 1 year ago

I did none of the above; I instead made sift::save_dictionary() output an index file with factor files. This is

  1. Better for me because it keeps tsv2label's logic unchanged: Factor info is always stored in TSVs, not in TSVs OR deparsed code across two columns of the index.
  2. Far better for users because it generates TSVs that they can easily view and edit.
  3. Better for the package's stated goal of making dictionaries that are easy to assemble, clean, edit, and track.