Closed DesiQuintans closed 1 year ago
sift
dictionary for it in the first place: You would have built a dictionary yourself from external resources and used tsv2label
's normal workflow. sift
dictionary reflects the state of the data as it exists in R. For example, if you build a dictionary for poker
without modifying it, then its dictionary will show that columns C1
:CLASS
are numeric. But if you run factorise_with_dictionary()
to convert those to factors, then the updated sift
dictionary will show that it is a factor and store its levels/labels.sift
dictionary to label a foreign dataset, the levels will be sort(unique(df$col))
, and the labels in the dictionary will be applied in that order (this is default R behaviour).
b. If using a sift
dictionary on a dataset that has been exported from R, then its factors will be exported as labels, and I can feed fct_lvl
into both levels =
and labels =
args to return the factor in its expected order.tsv2label
side:factorise_with_dictionary()
index
called ordered_fct
(or something) which will be fed into factor(ordered = )
.index
called fct_lvl
(or some other name), which will contain the deparsed code for the factor's levels. If this column exists, ignore the factor_file
column.fct_lvl
.
fct_lvl
into both levels =
and labels =
args of factor()
.fct_lvl
into labels =
only, and let R work out the levels.I did none of the above; I instead made sift::save_dictionary()
output an index file with factor files. This is
tsv2label
's logic unchanged: Factor info is always stored in TSVs, not in TSVs OR deparsed code across two columns of the index
.
One of my other packages,
sift
, helps me find the right variable in a dataframe of hundreds of columsn by letting me fuzzily search through the dataframe's column names, variable labels, factor labels, and unique values. It generates a dictionary for each dataset that it uses to do these searches, and it looks like this:This is close to being usable by
tsv2label
; ifvarname
was changed toname
andtsv2label
was told to look for a column calledfct_lvl
and use it as deparsed code, then it would work.I don't know how useful this actually is, though. The reason why
tsv2label
is useful is because datasets from foreign packages need type conversion and relabelling to get them into R. Once the dataset is in R it can be labelled, manipulated, and then the exact object can be saved withsave()
orsaveRDS()
, and that can be distributed to other R users.I suppose it's useful if you want to redistribute the dataset as a .CSV or some other accessible format? That would be a good enough reason. Alternatively, it would be useful if you had to keep the data in such a format, like if you were particularly dependent on
vroom
reading a very large CSV.