SINGROUP / dscribe

DScribe is a python package for creating machine learning descriptors for atomistic systems.
Apache License 2.0
385 stars 88 forks source link

Feature vector - "feature names" #68

Open materialsguy opened 2 years ago

materialsguy commented 2 years ago


I'm currently analysing a machine learning model of somebody else, that is trained using soap feature vectors. The code generating the feature vector looks something like that:

soap = SOAP(species=species, periodic=True, rcut=2.5, nmax=8, lmax=8, average="inner", sparse=False) feature_vectors = soap.create(atoms, n_jobs=1)

Where species is a set that holds the different element names and atomsis a list containing Atom typed elements like: Atoms(symbols='O18Al12', pbc=True, cell=[[4.76, 0.0, 0.0], [-2.379999999999999, 4.122280922013928, 0.0], [0.0, 0.0, 12.993]], spacegroup_kinds=...). The feature_vectors are then transformed into a rather big pd.dataframe that contains 1109304 columns.

Is there a way to find out the feature names (physical meaning) of the single values of a feature_vector? For me currently it is "just" a row in a dataframe which the model then is based on without any column descriptions. For my analysis it would be interesting to know which column is representing what in a physical way since my analysis results in some kind of feature importance of the respective column.

Thank you very much.

Best regards,


lauri-codes commented 2 years ago

Hi @materialsguy!

This is an excellent topic. Some time ago I saw something similar in matminer, where you can call feature_labels() to get some kind of information about the features. I do have this as one of the TODO's in our kanban, but as of now, it is not directly possible.

In practice implementing it should be fairly straightforward, but I cannot give any timeline on this. It is possible to reverse-engineer some of the label information by using the get_location()-method, which gives the slice for the given species-pair. But this does not currently support getting the location of specific (l, n)-values.

materialsguy commented 2 years ago

Thank you for the quick reply. I also think such an implementation would really help from a machine learning feature engineering & feature analysis perspective, especially when the analysis is done by somebody that has not the full knowledge about the feature vectors themselves from a physical point of view. Please let me know when you implemented it.

I will have a look at the get_location()-method.
