Data handling - Githubissues

compneurobilbao / ageml

AgeML is a Python package for Age Modeling with Machine Learning made easy.

Apache License 2.0

6 stars 1 forks source link

Data handling #34

Open JGarciaCondado opened 7 months ago

JGarciaCondado commented 7 months ago

The software package is dealing currently with tabular data only. However, there is one important aspect that has not been dealt with categorical variables.

To improve this:

We need to add detection of categorical variables in the features, covariate and factors file.
Apply correct handling of theses variables. A commonly used strategy is conversion to one-hot encoding.
In terms of age modelling we should ensure that these are appropriately treated in the scaler.

Another aspect of data handling is data imputation. Currently, any subject with missing data in any of the files submitted is discarded. However, some basic imputation strategies could be implemented.

JGarciaCondado commented 4 months ago

We should also allow when naming multiple systems that when we have missing data for one subject for a system but not for another system we should only remove the subject when calculating the age model of that specific system.

JGarciaCondado commented 1 month ago

We have also found a new bug/problem. If you upload a .csv with an index that is not numeric an error is thrown. We should test and fix so that files that have a first column named subject with values sub001, sub002, sub003, ... work. Otherwise we should specify that files should have a column called ID (this will avoid less problems and in loading .csv ID column should be made the index). However, we should still ensure that the indices can be random numbers or alphanumeric values.

JGarciaCondado commented 2 weeks ago

When looking at at clinical factors we should not be removing all the subjects that have NaN in a factor. This is because in many studies some subjects have some tests and others others. We are therefore reducing drastically the number of subjects. I would go for an approach where we report the number of subjects used in each factor but keep as many as possible. Imputation here would not be a good strategy.