split input entry in training and test dataset

FAIRmat-NFDI / nomad-analysis

This repo contains the standard NOMAD plugin for analysis.

https://fairmat-nfdi.github.io/nomad-analysis/

Apache License 2.0

1 stars 0 forks source link

split input entry in training and test dataset #25

Closed aalbino2 closed 3 months ago

aalbino2 commented 3 months ago

as a next step we need to randomly split our features and labels in training and test sets. I think this is possible after generating the dataframe that is injested in the ML code

JosePizarro3 commented 3 months ago

I think we should define some meaningful string(s) and refs to fully understand the provenance of any data coming from an analysis. We had the idea of trying source=Quantity(type=MEnum('simulation', 'measurement', 'analysis')), so maybe this could be used or repurpose for this? What do you have in mind?

ka-sarthak commented 3 months ago

@JosePizarro3 I think this is related to the Jupyter notebook that @aalbino2 and I are developing for Ta-Shun's ML use case in Physical Vapor Deposition. This particular issue should not bring any changes to the code quality of this repo.

On the other hand, I am happy to explore the idea of source quantity and have opened a separate issue for this.

ka-sarthak commented 3 months ago

@aalbino2 The splitting of the generated DataFrame into train and test DataFrames is already happening in the sklearn.model_selection.train_test_split function that we use in the notebook. Do you mean something different than this?