Filter out unneeded series from studies

stefpiatek commented 5 months ago

Definition of Done / Acceptance Criteria

As a researcherer I'd like to only get series that I'm interested in, not receiving all series in a study indiscriminately. One case of this is excluding positioning series (aka localisers)

[ ] Implement feature to robustly filter out series based on series description
[ ] Configure filtering out localiser scans for the MS-PINPOINT project

Testing

Will need to expand our set of test dicom files - but generate on the fly as they're large and we don't want loads of them in the git repo!

[ ] Generate test DICOM files with variations of known positioning series descriptions
[ ] Test that developed series filtering succesfully excludes these scans

Documentation

I suggest that as part of this we create a template project config file (or perhaps a markdown doc) that explains what all the parameters do, as I don't see this documented anywhere. And perhaps in the comments to project_config.py? (without repeating ourselves too much).

Dependencies

No response

Details and Comments

May have to be filtering on receipt, probably a regex because inconsistent naming and numbering?

jeremyestein commented 5 months ago

Summary of chat with @HChughtai

A study is a whole imaging "session" (but that word means something else so don't use it).

A series is a set of instances (images) within a study taken with the same physics parameters, and with a certain purpose (eg. to search for areas to scan in more detail)

Every MRI study will have a T1 weighted scan, you would typically want this included (it's a bit like a reference or baseline for other types of scan).

But we will want to filter out some types of series that are unlikely to be useful for our research users, such as "positioning" series (synonyms: "localizer", "scout")

The DICOM tag "Series Description" is likely the most useful one here. Unfortunately, it's based - at least in part - on manually entered data, so will vary from machine to machine, operator to operator, and could have typos etc.

So the person specifying the research project will not be able to give a simple list of series description strings to include or exclude.

We assume the purpose of this filter is to reduce disk space and network transfers and reduce the amount of unwanted data that the researchers have to deal with, but that there are no anonymisation consequences here.

Therefore, it is better to accidentally include rather than to accidentally exclude something, and therefore, at least as a first pass, we should focus on what things we can safely exclude, rather than drawing up a list of safe strings for inclusion or trying to classify every single series description.

A starting point might be to exclude anything containing a string suggesting it's a positioning series (see synonyms above, and consider alternate spellings). It's surely unlikely someone would include that string in a series description if it were not such a thing, given the technical meaning.

However, it seems clear that we need to implement a framework for including and excluding things via the project config file, rather than eg. hardcoding strings.

You can also query the VNA metadata, and then only request the series you want, to reduce load on the VNA. Maybe not in the first pass though.

We should record somewhere which series got rejected (in DB or logs)

Dicom files: one file per series is the modern standard; older standard (which the hospital likely uses) is one file per instance.

Jon Stutters wrote a script to generate mock dicom data. There's apparently also a pre-existing library for this.

HChughtai commented 5 months ago

Thanks for the notes @jeremyestein - looks like we're agreed on the way forward, but to summarise:

this issue will focus on filtering out series that are uneeded for research
Filtering can occur either when pulling from PACS/VNA or after recieved in PIXL (likely the latter in this pass)
We should record what we reject to aid in debugging
We should have an eye on future environments and design an approach that could be customised further in future.

@milanmlft was also looking into modifying the script that Jon Stutters wrote to generate mock DICOM data, so may be worth having a chat with him too. The library that I mentioned was https://github.com/sjoerdk/dicomgenerator but it's no actively maintained and doesn't seem to allow full configuration of DICOM tag content

milanmlft commented 5 months ago

See #348 for an updated version of the mock DICOM generator

UCLH-Foundry / PIXL