cistrome / MIRA

Python package for analysis of multiomic single cell RNA-seq and ATAC-seq.
52 stars 7 forks source link

The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2. #36

Closed christophechu closed 5 months ago

christophechu commented 6 months ago

tuner.fit(adata) File ~/miniforge-pypy3/envs/mira-env/lib/python3.11/site-packages/sklearn/model_selection/_split.py:1746, in BaseShuffleSplit.split(self, X, y, groups) 1716 """Generate indices to split data into training and test set. 1717 1718 Parameters (...) 1743 to an integer. 1744 """ 1745 X, y, groups = indexable(X, y, groups) -> 1746 for train, test in self._iter_indices(X, y, groups): 1747 yield train, test

File ~/miniforge-pypy3/envs/mira-env/lib/python3.11/site-packages/sklearn/model_selection/_split.py:2147, in StratifiedShuffleSplit._iter_indices(self, X, y, groups) 2145 class_counts = np.bincount(y_indices) 2146 if np.min(class_counts) < 2: -> 2147 raise ValueError( 2148 "The least populated class in y has only 1" 2149 " member, which is too few. The minimum" 2150 " number of groups for any class cannot" 2151 " be less than 2." 2152 ) 2154 if n_train < n_classes: 2155 raise ValueError( 2156 "The train_size = %d should be greater or " 2157 "equal to the number of classes = %d" % (n_train, n_classes) 2158 )

christophechu commented 6 months ago

I've found that this problem may be occurring because of the continuous variables set in the model (continuous_covariates such as % of mitchondrial). When I don't set it up, it works fine.

AllenWLynch commented 6 months ago

Ah I know what may be causing this issue.

When you provide continuous covariates, those covariates are split into discrete quantiles so that the train and test sets can be stratified across them. In your case, one of those quantiles only contained one data point (possibly an extreme outlier), which caused the error - I had not thought through this edge case.

I will work on this, but in the interim, you could pre-discretize your continuous covariates to make sure there are more than one sample in each group.

AL

craniolab commented 3 months ago

Hi,

I'm on the latest version (2.1.1).

I tried setting up the model with continuous covariates to correct for the cell cycle (S and G2M scores) and I still run into this issue. Even if I limit it to one of the covariates instead of both. When I remove the continuous covariates everything runs fine.

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

ValueError Traceback (most recent call last) Cell In[42], line 1 ----> 1 tuner.fit(adata) File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/mira/topic_model/hyperparameter_optim/trainer.py:769, in BayesianTuner.fit(self, train, test) 761 cache = DataCache( 762 self.model, train, 763 self.study_name.replace('/','-'), (...) 766 train_size = self.train_size 767 ) 768 logger.info('Writing data cache to: {}'.format(cache.get_cache_path())) --> 769 train, test = cache.cache_data(overwrite=True) 770 # automatically write cache 772 try: File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/mira/topic_model/base.py:250, in DataCache.cache_data(self, overwrite) 247 else: 248 os.mkdir(self.get_cache_path()) --> 250 train, test = self.model.train_test_split( 251 self.adata, 252 seed = self.seed, 253 train_size = self.train_size 254 ) 256 self.model.write_ondisk_dataset(train, dirname = self.train_cache) 257 self.model.write_ondisk_dataset(test, dirname = self.test_cache) File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/mira/topic_model/base.py:494, in BaseModel.train_test_split(self, adata, train_size, seed, stratify) 486 covariates_bins.append( 487 self.digitize_continuous_covariate( 488 adata.obs_vector(covar) 489 ) 490 ) 492 stratify = list(map(tuple, list(zip(*covariates_bins)) )) --> 494 train_idx, test_idx = train_test_split( 495 np.arange(len(adata)), 496 train_size = train_size, 497 random_state = seed, 498 shuffle = True, 499 stratify = stratify 500 ) 502 return adata[train_idx], adata[test_idx] File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/sklearn/utils/_param_validation.py:214, in validate_params..decorator..wrapper(*args, **kwargs) 208 try: 209 with config_context( 210 skip_parameter_validation=( 211 prefer_skip_nested_validation or global_skip_validation 212 ) 213 ): --> 214 return func(*args, **kwargs) 215 except InvalidParameterError as e: 216 # When the function is just a wrapper around an estimator, we allow 217 # the function to delegate validation to the estimator, but we replace 218 # the name of the estimator by the name of the function in the error 219 # message to avoid confusion. 220 msg = re.sub( 221 r"parameter of \w+ must be", 222 f"parameter of {func.__qualname__} must be", 223 str(e), 224 ) File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/sklearn/model_selection/_split.py:2670, in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays) 2666 CVClass = ShuffleSplit 2668 cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state) -> 2670 train, test = next(cv.split(X=arrays[0], y=stratify)) 2672 return list( 2673 chain.from_iterable( 2674 (_safe_indexing(a, train), _safe_indexing(a, test)) for a in arrays 2675 ) 2676 ) File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/sklearn/model_selection/_split.py:1746, in BaseShuffleSplit.split(self, X, y, groups) 1716 """Generate indices to split data into training and test set. 1717 1718 Parameters (...) 1743 to an integer. 1744 """ 1745 X, y, groups = indexable(X, y, groups) -> 1746 for train, test in self._iter_indices(X, y, groups): 1747 yield train, test File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/sklearn/model_selection/_split.py:2147, in StratifiedShuffleSplit._iter_indices(self, X, y, groups) 2145 class_counts = np.bincount(y_indices) 2146 if np.min(class_counts) < 2: -> 2147 raise ValueError( 2148 "The least populated class in y has only 1" 2149 " member, which is too few. The minimum" 2150 " number of groups for any class cannot" 2151 " be less than 2." 2152 ) 2154 if n_train < n_classes: 2155 raise ValueError( 2156 "The train_size = %d should be greater or " 2157 "equal to the number of classes = %d" % (n_train, n_classes) 2158 ) ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

AllenWLynch commented 3 months ago

Hi Tian,

I have seen this issue before, the continuous variables are broken into quantiles for stratification, and one of the quantiles ends up only having one member. I will see why.

You can circumvent this by broadly discretizing the continuous covariates yourself if you like, but this shouldn't change things too much unless the classes are imbalanced.

Let me know if this helps, AL


From: craniolab @.> Sent: Wednesday, March 20, 2024 9:42 AM To: cistrome/MIRA @.> Cc: AllenWLynch @.>; State change @.> Subject: Re: [cistrome/MIRA] The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2. (Issue #36)

Hi,

I'm on the latest version (2.1.1).

I tried setting up the model with continuous covariates to correct for the cell cycle (S and G2M scores) and I still run into this issue. Even if I limit it to one of the covariates instead of both. When I remove the continuous covariates everything runs fine.

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

`ValueError Traceback (most recent call last) Cell In[42], line 1 ----> 1 tuner.fit(adata)

File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/mira/topic_model/hyperparameter_optim/trainer.py:769, in BayesianTuner.fit(self, train, test) 761 cache = DataCache( 762 self.model, train, 763 self.study_name.replace('/','-'), (...) 766 train_size = self.train_size 767 ) 768 logger.info('Writing data cache to: {}'.format(cache.get_cache_path())) --> 769 train, test = cache.cache_data(overwrite=True) 770 # automatically write cache 772 try:

File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/mira/topic_model/base.py:250, in DataCache.cache_data(self, overwrite) 247 else: 248 os.mkdir(self.get_cache_path()) --> 250 train, test = self.model.train_test_split( 251 self.adata, 252 seed = self.seed, 253 train_size = self.train_size 254 ) 256 self.model.write_ondisk_dataset(train, dirname = self.train_cache) 257 self.model.write_ondisk_dataset(test, dirname = self.test_cache)

File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/mira/topic_model/base.py:494, in BaseModel.train_test_split(self, adata, train_size, seed, stratify) 486 covariates_bins.append( 487 self.digitize_continuous_covariate( 488 adata.obs_vector(covar) 489 ) 490 ) 492 stratify = list(map(tuple, list(zip(*covariates_bins)) )) --> 494 train_idx, test_idx = train_test_split( 495 np.arange(len(adata)), 496 train_size = train_size, 497 random_state = seed, 498 shuffle = True, 499 stratify = stratify 500 ) 502 return adata[train_idx], adata[test_idx]

File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/sklearn/utils/_param_validation.py:214, in validate_params..decorator..wrapper(*args, *kwargs) 208 try: 209 with config_context( 210 skip_parameter_validation=( 211 prefer_skip_nested_validation or global_skip_validation 212 ) 213 ): --> 214 return func(args, **kwargs) 215 except InvalidParameterError as e: 216 # When the function is just a wrapper around an estimator, we allow 217 # the function to delegate validation to the estimator, but we replace 218 # the name of the estimator by the name of the function in the error 219 # message to avoid confusion. 220 msg = re.sub( 221 r"parameter of \w+ must be", 222 f"parameter of {func.qualname} must be", 223 str(e), 224 )

File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/sklearn/model_selection/_split.py:2670, in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays) 2666 CVClass = ShuffleSplit 2668 cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state) -> 2670 train, test = next(cv.split(X=arrays[0], y=stratify)) 2672 return list( 2673 chain.from_iterable( 2674 (_safe_indexing(a, train), _safe_indexing(a, test)) for a in arrays 2675 ) 2676 )

File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/sklearn/model_selection/_split.py:1746, in BaseShuffleSplit.split(self, X, y, groups) 1716 """Generate indices to split data into training and test set. 1717 1718 Parameters (...) 1743 to an integer. 1744 """ 1745 X, y, groups = indexable(X, y, groups) -> 1746 for train, test in self._iter_indices(X, y, groups): 1747 yield train, test

File ~/micromamba/envs/mira_env/lib/python3.11/site-packages/sklearn/model_selection/_split.py:2147, in StratifiedShuffleSplit._iter_indices(self, X, y, groups) 2145 class_counts = np.bincount(y_indices) 2146 if np.min(class_counts) < 2: -> 2147 raise ValueError( 2148 "The least populated class in y has only 1" 2149 " member, which is too few. The minimum" 2150 " number of groups for any class cannot" 2151 " be less than 2." 2152 ) 2154 if n_train < n_classes: 2155 raise ValueError( 2156 "The train_size = %d should be greater or " 2157 "equal to the number of classes = %d" % (n_train, n_classes) 2158 )

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.`

— Reply to this email directly, view it on GitHubhttps://github.com/cistrome/MIRA/issues/36#issuecomment-2009739424, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AE43JPCFQRDV6OHKWVLFVEDYZGN6HAVCNFSM6AAAAABA2RQEB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBZG4ZTSNBSGQ. You are receiving this because you modified the open/close state.Message ID: @.***>

craniolab commented 3 months ago

I checked the stratification and indeed there are a bunch of them with only 1 cell. It is mostly because it is combining the batch (categorical) strata and then two stratified continuous variables (S and G2M score).

The dataset I'm working with is a developmental series, so the proportion of cells in a certain part of the cell cycle changes per batch, making the entire stratification difficult. Complicated by different sequencing methods between some of the batches. So far Mira is producing the best integration results by far.

I tried changing the stratification, and while it worked better I still got a couple of edge cases that were occurring only once. In the end I created the train/test splits and moved the edge cases to the training set and then fed that to the tuner and that seemed to work.