Kaaiian / CBFV

Tool to quickly create a composition-based feature vector
25 stars 6 forks source link

`generate_features(..., extend_features=...)` InvalidIndexError: Reindexing only valid with uniquely valued Index objects #10

Open sgbaird opened 2 years ago

sgbaird commented 2 years ago
Processing Input Data: 100%|██████████| 1794/1794 [00:00<00:00, 7378.49it/s]
    Featurizing Compositions...
Assigning Features...: 100%|██████████| 1778/1778 [00:00<00:00, 3426.03it/s]
NOTE: Your data contains formula with exotic elements. These were skipped.
    Creating Pandas Objects...

InvalidIndexError                         Traceback (most recent call last)
[<ipython-input-45-22826a03d387>](https://localhost:8080/#) in <module>()
      1 from CBFV import composition
----> 2 X, y, formulae, skipped = composition.generate_features(df, extend_features="R")

4 frames
[/usr/local/lib/python3.7/dist-packages/CBFV/composition.py](https://localhost:8080/#) in generate_features(df, elem_prop, drop_duplicates, extend_features, sum_feat, mini)
    307         extended = pd.DataFrame(extra_features, columns=features)
    308         extended = extended.set_index('formula', drop=True)
--> 309         X = pd.concat([X, extended], axis=1)
    311     # reset dataframe indices

[/usr/local/lib/python3.7/dist-packages/pandas/util/_decorators.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    313         return wrapper

[/usr/local/lib/python3.7/dist-packages/pandas/core/reshape/concat.py](https://localhost:8080/#) in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    305     )
--> 307     return op.get_result()

[/usr/local/lib/python3.7/dist-packages/pandas/core/reshape/concat.py](https://localhost:8080/#) in get_result(self)
    526                     obj_labels = obj.axes[1 - ax]
    527                     if not new_labels.equals(obj_labels):
--> 528                         indexers[ax] = obj_labels.get_indexer(new_labels)
    530                 mgrs_indexers.append((obj._mgr, indexers))

[/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py](https://localhost:8080/#) in get_indexer(self, target, method, limit, tolerance)
   3441         if not self._index_as_unique:
-> 3442             raise InvalidIndexError(self._requires_unique_msg)
   3444         if not self._should_compare(target) and not is_interval_dtype(self.dtype):

InvalidIndexError: Reindexing only valid with uniquely valued Index objects
sgbaird commented 2 years ago

Seems to be an issue with repeat chemical formulas in the DataFrame

sgbaird commented 2 years ago

Workaround is to use a for loop for the other properties of interest, renaming the column of interest each time.

For example:

from CBFV.composition import generate_features
ys = []
for name in ["property1", "property2", "property3"]:
  X, y, formulae, skipped = generate_features(df.rename(columns={name: "target"}))