SamsungSAILMontreal / ForestDiffusion

Generating and Imputing Tabular Data via Diffusion and Flow XGBoost Models
131 stars 13 forks source link

Error ("missing values are not allowed in subscripted assignments of data frames") #6

Closed ashwanijha1 closed 9 months ago

ashwanijha1 commented 10 months ago

Hi,

Thanks for providing this tool. I am using the R version and have managed to get it working on the Iris dataset (and another personal dataset) just fine.

With one particularly large dataset, the model finished running after a few days but when I try to generate images from it/ impute images from it I get the following error:

Error in [<-.data.frame(*tmp*, , names_with_prefix[1], value = c(1L, : missing values are not allowed in subscripted assignments of data frames

Any ideas please?

Thanks, Ash

Code run:


forest_model = ForestDiffusion(X=Traindata, n_cores=4, n_t=50, duplicate_K=100, flow=FALSE, seed=123) X_fake = ForestDiffusion.generate(forest_model, batch_size=1, seed=113)

System details:


platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 4
minor 3.2
year 2023
month 10
day 31
svn rev 85441
language R
version.string R version 4.3.2 (2023-10-31) nickname Eye Holes

AlexiaJM commented 10 months ago

Hi Ash,

This bugs seems to be related to the names of the variables. Can you verify that you are giving it a data frame with clean names? It shouldn't cause problems, but I'm just making sure.

It seems that there is also a weird bug with batch_size=1 and generate. I get a bug when doing Iris with batch_size <= 3. I recommend you use a large enough batch_size value in the generate function to prevent this bug.

Otherwise, I'm not sure what this error could be.

I assume you cannot use a larger n_cores due to memory, but if you can, bumping n_cores to 8 (if you have 8+ cpus) would likely double your training speed. Btw, if your data is huge, you may not need as much duplicate_K. Of course high duplicate_K is better, but you would probably get good results with duplicate_K=10 if the data is big enough and it would train much faster.

ashwanijha1 commented 10 months ago

Hi Alexia,

Thanks for the quick response!

I did have some 'unclean' variable names but the same error occurs when I use clean column names (e.g. "DischargeType1")

I think the error is in: _ForestDiffusion.clean_dummydata and will occur when there are both categorical and non-categorical data types in the dataset.

I would be grateful if you could have a look?

thanks, Ash

My partial debugging attempt below:

function (object, X) { if (length(object$cat_indexes) > 0) { prefixes_suffixes = strsplit(names(X), "\.") prefixes = vector("character", NCOL(X)) for (i in 1:NCOL(X)) { prefixes = prefixes_suffixes[[i]][1] # should this be prefixes[i] = prefixes_suffixes[[i]][1] } unique_prefixes = unique(prefixes) for (i in 1:length(unique_prefixes)) { names_with_prefix = names(X)[grepl(paste0(unique_prefixes[i], "."), names(X))] # this returns an empty vector for binary categories and continuous data, OK for multi-category data cat_vars = cbind(rep(0.5, NROW(X)), X[, names_with_prefix]) # this returns the error in question if names_with_prefix is empty max_index = max.col(cat_vars) X[, names_with_prefix[1]] = max_index name_no_suffix = strsplit(names(X)[names(X) == names_with_prefix[1]], "\.")[[1]][1] names(X)[names(X) == names_with_prefix[1]] = name_no_suffix X = X[, !grepl(paste0(unique_prefixes[i], "."), names(X))] } } return(X) }

AlexiaJM commented 10 months ago

Hi Ash,

I made the fixes to the clean_dummy_data function in the R package. You were correct in your debugging attempt; thanks, that was very helpful in making the bug fix! It should work properly now. Let me know if you run into other issues.

Btw, I just made some major updates to the Python library, making it memory-efficient. If you are dealing with large datasets (N > 20K), I recommend trying it out.

library(ForestDiffusion)

# Load iris
data(iris)
# variables 1 to 4 are the input X
# variable 5 (iris$Species) is the outcome (class with 3 labels)

# Add NAs (but not to label) to emulate having a dataset with missing values
iris[,1:4] = missForest::prodNA(iris[,1:4], noNA = 0.2)

# Setup data
X = data.frame(iris[,1:4])
y = iris$Species
Xy = iris
plot(Xy)

# Add new categorical variable
Xy$caty = 1
Xy$caty[1:25] = 0
Xy$caty = factor(Xy$caty) 

forest_model = ForestDiffusion(X=Xy, n_cores=2, n_t=2, duplicate_K=1, flow=TRUE, seed=666)
Xy_fake = ForestDiffusion.generate(forest_model, batch_size=NROW(Xy), seed=666) # breaks in the old code, but works now
plot(Xy_fake)
ashwanijha1 commented 10 months ago

Hi Alexia,

Thanks for sorting this. I still get another error though this time in ForestDiffusion.clip_extremes_clean:

Error in factor(X[, i], labels = object$cat_levels[[j]]) : invalid 'labels'; length 4 should be 1 or 8

What happens is that this assigns more categories to a categorical variable (in my case an ordinal variable) than are specified on the original data.

I'm not sure why this is but I found the following issue upstream in ForestDiffusion.cleam dummy_data, which seems to output the wider dummy-coded X for ordinal categorical variables (i.e. with var, var.2, var.3 etc). So when ForestDiffusion.clip_extremes_clean performs the following, the object$int_indexes are being applied to the wider dummy-coded X (and so some categorical variables are being rounded to integers).

for (i in object$int_indexes) { X[, i] = round(X[, i], 0) }

But doesn't explain why your unit test still works... please could you have another look?

There is also a warning about saving XGboosts which I will raise separately.

Thanks, Ash

AlexiaJM commented 10 months ago

Hi Ash,

I found the bugs and fixed them, everything should work properly now with categorical variables. The re-conversion from numeric to factor variables was bugged, I made sure that it works perfectly now.

Oh and the bug only appeared when a class was rare and not generated. In that case, we got the error, now even if a class is rare and is not generated, it will not give a bug.

Here is an example that had your bug before, but doesn't anymore (even though the rare class 'C' from Xy$catc is never generated in this specific generated batch of fake samples):

library(ForestDiffusion)

set.seed(1) 

# Load iris
data(iris)
# variables 1 to 4 are the input X
# variable 5 (iris$Species) is the outcome (class with 3 labels)

# Setup data
X = data.frame(iris[,1:4])
y = iris$Species
Xy = iris
plot(Xy)

# Add new categorical variable
Xy$cata = 1
Xy$cata[1:25] = 0
Xy$cata = factor(Xy$cata)

Xy$catb = "A"
Xy$catb[1:45] = "B"
Xy$catb[48:70] = "C"
Xy$catb = factor(Xy$catb)

Xy$catc = "X"
Xy$catc[1:15] = "B"
Xy$catc[48:52] = "C"
Xy$catc = factor(Xy$catc)

# Add NAs (but not to label) to emulate having a dataset with missing values
Xy = missForest::prodNA(Xy, noNA = 0.1)

forest_model = ForestDiffusion(X=Xy, n_cores=2, n_t=2, duplicate_K=1, flow=TRUE, seed=666)
Xy_fake = ForestDiffusion.generate(forest_model, batch_size=NROW(Xy), seed=3) # breaks in the old code, but works now
plot(Xy_fake)
ashwanijha1 commented 9 months ago

Hi Alexia,

Thanks for this - sorry for the late reply. The ForestDiffusion.generate code now runs without flagging an error.

But I have noticed some issues with the samples generated (on my own dataset, not using Iris, using a very low n_t=5 and duplicate_k=5 for testing):

Could you have a look again please?

thanks, Ash

AlexiaJM commented 9 months ago

I mistakenly deleted my response. I was saying that I couldn't find a problem with the code. One possible issue could be the ordinal variable if the number of categories is massive, or the fact that ForestDiffusion automatically removes rows that are filled with only NA. Otherwise, I don't know what the problem could be. If you manage to make a toy case where it breaks, I would be able to diagnose the problem, but for now I cannot. I recommend treating the ordinal variables as continuous (the software will automatically round 'integer variables' (continuous variables without decimals)).

ashwanijha1 commented 9 months ago

Thanks for looking into this Alexia,

I've generated a dummy dataset where I get a (possibly related) error in ForrestDiffusion.generate, I think during ForestDiffusion.clean_dummy_data

Error in[.data.frame(X, , i) : undefined columns selected

Can I email you the dataset pls?

Thanks, Ash

AlexiaJM commented 9 months ago

sure

AlexiaJM commented 9 months ago

Hi Ash,

Thanks to your data, I was able to find the bug with your data and fix a few others things!

1) grepl (a string search function) ignores symbol as default behaviour, so instead of search for "Var_1.", it was searching for "Var_1" and thus the variables "Var_11", "Var_12", etc. were picked when they shouldn't have. I have fixed this bug by telling grepl to include symbols in its search.

2) The code assumed that factors levels were in the same order as the data. This is because if you do factor(c(3,2,2,3,1)), the levels will be c(3,2,1) (in the same order as first found in the vector). But your data had factor variables such as c(3,2,2,3,1) with levels c(1,2,3). I fixed the issue so that factor variables use the correct levels.

3) The code did not handle ordered variables and automatically converted everything to regular unordered factors. This is fixed now.

Everything should work great now!

Btw, its not important to the ForestDiffusion, but I would recommend that you trim some of the variables you have. If you do polycor::hetcor(data), which calculate the correlation between your variables, you will see some NA in Var_15 and Var_17. This is because there are two few values without missing data to calculate the correlations. I recommend removing extremely rare categories like this and also possibly removing highly co-linear variables (those with > .90 correlation) as they could cause problems in your analyses. This is just an advice from my previous years as a Biostatistician, you don't have to follow it.

Alexia

ashwanijha1 commented 9 months ago

This all works now great thanks for your fantastic help Alexia!

Just to let you know that if there is a particular issue in the data as you describe (a factor level is specified but doesn't appear in the data) then ForrestDiffusion.generate throws an error. But this is a prob with the input data.

Thanks for the other advice about the data - this wasn't a real dataset I gave you but a shuffled subsample so any correlations are random etc

Ash