DesiQuintans commented 1 year ago

One of my other packages, sift, helps me find the right variable in a dataframe of hundreds of columsn by letting me fuzzily search through the dataframe's column names, variable labels, factor labels, and unique values. It generates a dictionary for each dataset that it uses to do these searches, and it looks like this:

> library(sift)
> dict <- sift(iris)
ℹ Building dictionary for 'iris'. This only happens when it changes.
✔ Dictionary was built in 0.01 secs.

ℹ Dictionary has 5 columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species.

> dplyr::glimpse(dict)
Rows: 5
Columns: 13
$ colnum      <int> 1, 2, 3, 4, 5
$ varname     <chr> "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"
$ var_lab     <chr> "", "", "", "", ""
$ rand_unique <chr> "5.1, 6.8, 4.7, 7.9, 7.1, 6.3, 4.5, 5.2, 7.6, 5.6, 4.6, 7.3, 7.4, 6.2, 6, 4.4, …
$ pct_miss    <dbl> 0, 0, 0, 0, 0
$ pct_nonmiss <dbl> 100, 100, 100, 100, 100
$ type_str    <chr> "double", "double", "double", "double", "factor ×3"
$ all_same    <chr> "No", "No", "No", "No", "No"
$ val_lab     <chr> "NULL", "NULL", "NULL", "NULL", "NULL"
$ fct_lvl     <chr> "NULL", "NULL", "NULL", "NULL", "c(\"setosa\", \"versicolor\", \"virginica\")"
$ class       <chr> "\"numeric\"", "\"numeric\"", "\"numeric\"", "\"numeric\"", "\"factor\""
$ type        <chr> "\"double\"", "\"double\"", "\"double\"", "\"double\"", "\"integer\""
$ haystack    <chr> "Sepal.Length 5.1, 6.8, 4.7, 7.9, 7.1, 6.3, 4.5, 5.2, 7.6, 5.6, 4.6, 7.3, 7.4, …
>

This is close to being usable by tsv2label; if varname was changed to name and tsv2label was told to look for a column called fct_lvl and use it as deparsed code, then it would work.

I don't know how useful this actually is, though. The reason why tsv2label is useful is because datasets from foreign packages need type conversion and relabelling to get them into R. Once the dataset is in R it can be labelled, manipulated, and then the exact object can be saved with save() or saveRDS(), and that can be distributed to other R users.

I suppose it's useful if you want to redistribute the dataset as a .CSV or some other accessible format? That would be a good enough reason. Alternatively, it would be useful if you had to keep the data in such a format, like if you were particularly dependent on vroom reading a very large CSV.

DesiQuintans commented 1 year ago

Assumptions of this interoperability

The dataset being labelled has been exported from R.
- If the dataset was a foreign dataset, then you wouldn't have been able to build a sift dictionary for it in the first place: You would have built a dictionary yourself from external resources and used tsv2label's normal workflow.
- The sift dictionary reflects the state of the data as it exists in R. For example, if you build a dictionary for poker without modifying it, then its dictionary will show that columns C1:CLASS are numeric. But if you run factorise_with_dictionary() to convert those to factors, then the updated sift dictionary will show that it is a factor and store its levels/labels.

Implications of this interoperability

R discards the original column contents when it converts to Factor. It stores Factor internally as integers 1:n, with labels for each. a. If using a sift dictionary to label a foreign dataset, the levels will be sort(unique(df$col)), and the labels in the dictionary will be applied in that order (this is default R behaviour). b. If using a sift dictionary on a dataset that has been exported from R, then its factors will be exported as labels, and I can feed fct_lvl into both levels = and labels = args to return the factor in its expected order.

DesiQuintans commented 1 year ago

Changes to be made on the `tsv2label` side:

[ ] Changes to factorise_with_dictionary()
- [ ] Look for a column in index called ordered_fct (or something) which will be fed into factor(ordered = ).
- [ ] Look for a column in index called fct_lvl (or some other name), which will contain the deparsed code for the factor's levels. If this column exists, ignore the factor_file column.
- [ ] Check if a sample of unique values from the column exist as levels in fct_lvl.
  - [ ] If they do, then feed fct_lvl into both levels = and labels = args of factor().
  - [ ] If they don't, then feed fct_lvl into labels = only, and let R work out the levels.

DesiQuintans commented 1 year ago

I did none of the above; I instead made sift::save_dictionary() output an index file with factor files. This is

Better for me because it keeps tsv2label's logic unchanged: Factor info is always stored in TSVs, not in TSVs OR deparsed code across two columns of the index.
Far better for users because it generates TSVs that they can easily view and edit.
Better for the package's stated goal of making dictionaries that are easy to assemble, clean, edit, and track.

DesiQuintans / tsv2label

Add interoperability with `sift` dictionaries? #1

Assumptions of this interoperability

Implications of this interoperability

Changes to be made on the `tsv2label` side:

DesiQuintans / tsv2label

Add interoperability with `sift` dictionaries? #1

Assumptions of this interoperability

Implications of this interoperability

Changes to be made on the tsv2label side:

Changes to be made on the `tsv2label` side: