GSTT-CSC / MLOps

Framework for building ML apps
GNU General Public License v3.0
9 stars 5 forks source link

improvements to XNAT data fetching #153

Open laurencejackson opened 9 months ago

laurencejackson commented 9 months ago

Currently the XNAT data is fetched using a set of functions that are specific to csc-mlops.

It would be good to investigate whether we can create a class that inherits from the base torch dataset type to facilitate integration with other torch tools. This would reduce a lot of boilerplate and let us roll in validation functions into the dataset object.

e.g. something like this allows us to use multiple xnat projects easily, we can inherit most of the dataet functionality from CacheDatset (the cache can be disabled by setting the cache_rate to 0.0).

from monai.util.data import CacheDataset

class XNATDataset(CacheDatset):
    def __init__(self, xnat_configuration, **kwargs)
        super etc

The dataset could include functions for validating data (checking all subjects return appropriate data objects etc). Then could be used like this:

from mlops.data import XNATDataset

training_data = XNATDataset(project_name, actions, xnat_configuration, transforms, workers, etc)

test_data = XNATDataset(holdout_data_project_name, actions, xnat_configuration, transforms, workers, etc)

train_dl = Dataloader(training_data)

This would require some exploratory work to check it all looks good at works with pytorch lightning/monai etc but would be really useful in simplifying the Datamodule structure.