Sydney-Informatics-Hub / xnat-uploader

Command-line tool for batch uploading of DICOMs to XNAT
3 stars 0 forks source link

xnatuploader

A command-line tool for uploading batches of DICOMS to an XNAT server, building on the xnatutils library.

Usage

xnatuploader uses a 'two pass' approach to uploading files to an XNAT server. The first pass scans a directory for files to upload, builds a list of the files and their associated metadata and saves the list as a spreadsheet. The second pass reads the spreadsheet and uploads the files to XNAT.

On the second pass, the status of each file upload - whether it was successful, and any error messages if the upload failed - is written back to the spreadsheet.

The second pass can be re-run using the updated spreadsheet - files which have been successfully uploaded already will be skipped, and files which were not uploaded on earlier runs will be re-attempted.

The details of how xnatuploader gets metadata for each file are configured using the spreadsheet: xnatuploader can write out a pre-initialised spreadsheet before you run the first pass.

Windows

From the Start menu, select the Anaconda prompt. Once the Anaconda prompt is opened, you'll need to activate the conda environment in which xnatuploader was installed:

conda activate xnatuploader

Initialisation

xnatuploader is run by typing commands at the Anaconda prompt or terminal:

Initialising the spreadsheet

xnatuploader init --spreadsheet spreadsheet.xlsx

This initialises a spreadsheet with a single worksheet with default configuration values. Details of the configuration are in the section on finding files.

Scanning for files

xnatuploader scan --spreadsheet spreadsheet.xlsx --dir data_files

This scans the directory data_files for DICOMs and builds a list of files and metadata, which is stored in the spreadsheet as a new worksheet named 'Files'.

Uploading files

xnatuploader upload --spreadsheet spreadsheet.xlsx --dir data_files --project Test001 --server https://xnat.institution.edu/

This command will prompt you for your username and password on the XNAT server, read all the files from the spreadsheet and attempt to upload them.

The files will be uploaded to the project specified, using XNAT's hierarchy:

A Session can have multiple datasets, and a dataset will typically have many individual DICOM files.

The subject, session and dataset are based on the metadata values are extracted in the scanning pass. See the Finding files section below for more details and configuration instructions.

The upload command needs to know both the URL of the XNAT server and a project ID on the server to which the scans will be added. The project must already exist, and you must have the right access level to add scans to it.

You can also configure the XNAT URL and project ID in the spreadsheet - see the screenshot below for an example. If you specify an XNAT server or project ID as options on the command line, these values will be used in preference to the values in the spreadsheet.

Interrupting and restarting

When a file can't be uploaded due to a network error, or the integrity check for the upload fails, this error will be recorded in the spreadsheet and a warning will be printed to the command line. If you re-run the upload, the program will try to re-upload those files which didn't successfully upload on earlier runs.

While uploading is in progress, two progress bars will be shown on the command line. One shows the progress at a high level, with a step for every dataset in the upload (note that a single patient may have more than one dataset). The second progress bar shows a step for every file in the current dataset.

You can interrupt the upload by pressing Ctrl-C. If you do this, the program will prompt you to confirm whether you really want to stop. If you confirm, the current progress will be written back to the spreadsheet, and all files which haven't yet been uploaded will have a status written to indicate that the upload was interrupted. If you re-run the upload with the same spreadsheet, it will continue from where it was interrupted.

Finding files

When xnatupload scan is run, it scans the specified directory for files to upload. The scan looks for files at every level of the subdirectories within the directory, and tries to match values in the filepaths which can be used to determine how the files should be uploaded to XNAT.

Here's an example of a directory structure with scans for two patients. The top-level directories have the naming convention SURNAME^GIVENNAME-ID. Inside each patient's directory is a directory named for the date of their visit in YYYYMMDD format, and inside those is one or more directories for each type of scan they recieved on the visit. These directories contain the actual DICOM files for the scan.

Doe^John-0001/20200312/Chest CT/scan0000.dcm
Doe^John-0001/20200312/Chest CT/scan0001.dcm
Doe^John-0001/20200312/Chest CT/scan0002.dcm
Roe^Jane-0342/20190115/Head CT/scan0000.dcm
Roe^Jane-0342/20190115/Head CT/scan0001.dcm
Roe^Jane-0342/20190115/Head CT/scan0002.dcm
Roe^Jane-0342/20190115/Head CT/scan0003.dcm
Roe^Jane-0342/20200623/Head CT/scan0000.dcm
Roe^Jane-0342/20200623/Head CT/scan0001.dcm
Roe^Jane-0342/20200623/Head CT/scan0002.dcm
Roe^Jane-0342/20200623/Neck CT/scan0000.dcm
Roe^Jane-0342/20200623/Neck CT/scan0001.dcm
Roe^Jane-0342/20200623/Neck CT/scan0002.dcm
Roe^Jane-0342/20200623/Neck CT/scan0003.dcm

To transform these filepaths into XNAT hierarchy values, we need to tell xnatuploader to get the ID from the top-level directory and the scan type from the name of the directories containing the DICOM files ("Chest CT", "Head CT" and so on).

We also need to get the session date. This could be done using the second level of directories, but it's safer to get it from the DICOM metadata, which will have a value StudyDate with the date of the scan.

Here is a configuration worksheet which will get the correct XNAT values from this directory layout.

A screenshot of a spreadsheet

Configuration for inferring the XNAT values is in two sections, "Paths" and "Mappings", which correspond to two steps:

  1. "Paths" tells the script how to matching patterns against the filepath to create values
  2. "Mappings" assigns those values, or DICOM metadata values, to the XNAT hierarchy

Path matching

The "Paths" section of the config is one or more lists of patterns to be matched against paths. Each set of patterns starts with a label - in the example above, the label is "Nested". Each cell to the right of the label will be matched against one or more directory names in each of the filepaths, from left to right

If all matches are successful, that path will be marked as a row to be uploaded in the spreadsheet, and the values captured from the path will be assigned to the XNAT hierarchy values according to the "Mappings" section of the config worksheet.

The parts of the patterns in curly brackets like {SubjectName} are used to capture values. Patterns which are in all caps, such as {YYYY} or {II}, will only match numbers with the specified number of digits. All other patterns, for example {SubjectName}, will match any sequence of characters.

There are two special patterns, * and **. * matches a single directory with any name, and ** matches one or more directories with any name. ** lets you construct a pattern which will match files which might be nested at different levels for different patients.

Note: if one set of patterns isn't flexible enough to match all the ways in which scans are stored in the directory, you can add extra patterns as new rows in the spreadsheet. Each set of patterns needs a unique label. Sets of patterns will be matched in order from the top, and the first one which succeeds will be used.

XNAT hierarchy mapping

The "Mapping" section tells the script how to build the three XNAT hierarchy values, Subject, Session and Dataset, based on values captured from the paths and/or metadata fields read from the DICOM files. In the example, we're setting the Subject to the ID value, the Session to the StudyDate values extracted from the DICOMs, and the Dataset to the Directory value.

Matching example

Here's a step-by-step illustration of how the set of patterns in the example is matched against a filepath:

Doe^John-0001/20200312/Chest CT/scan0000.dcm

The XNAT hierarchy values are then built according to the rules in the "Mapping" section:

Note that not every value captured from the path needs to be used in "Mapping", as SubjectName and filename are ignored.

The filename for the purposes of uploading will be automatically generated from the path itself.

Checking the spreadsheet

By default, all files which match a pattern are marked as selected for upload in the spreadsheet. Before uploading, you can edit the spreadsheet to deselect files or groups of files which shouldn't be uploaded.

By default, files for which no match succeeds won't be written to the spreadsheet. You can run a scan with the --unmatched flag, which will write a row for every file whether or not the match succeeds:

xnatuploader scan --spreadsheet spreadsheet.xlsx --dir data_files --unmatched

This can be useful for figuring out why files aren't matching patterns.

Installation

If you're on Windows, you'll need to install Anaconda, which will install the Python programming language and environment manager on your PC.

Open the Anaconda prompt via the Start menu.

If you're on a Mac or Linux, you can also install Conda using the relevant installer, and just use a terminal for the rest of the installation instructions.

conda create -n xnatuploader

This will create a separate Python environment in which we'll install xnatuploader. Answer 'Y' when it prompts you to proceed.

conda activate xnatuploader

This activates the Python environment you've just created

pip install xnatuploader

This will download and install the latest version of xnatuploader. To check that everything has worked after the upload and installation has finished, type the following:

xnatuploader --help

You should get a message like the following:

usage: XNAT batch uploader [-h] [--dir DIR] [--spreadsheet SPREADSHEET]
                           [--server SERVER] [--project PROJECT]
                           [--loglevel LOGLEVEL] [--debug] [--logdir LOGDIR]
                           [--test] [--unmatched] [--overwrite]
                           {init,scan,upload,help}

positional arguments:
  {init,scan,upload,help}
                        Operation

optional arguments:
  -h, --help            show this help message and exit
  --dir DIR             Base directory to scan for files
  --spreadsheet SPREADSHEET
                        File list spreadsheet
  --server SERVER       XNAT server
  --project PROJECT     XNAT project ID
  --loglevel LOGLEVEL   Logging level
  --debug               Debug mode: only attempt to match 10 patterns and
                        generates a lot of debug messages
  --logdir LOGDIR       Directory to write logs to
  --test                Test mode: don't upload, just log what will be
                        uploaded
  --unmatched           Whether to include unmatched files in list
  --overwrite           Whether to overwrite files which have already been
                        uploaded

Task scheduler

xnatuploader can be run by the Windows Task Scheduler if it's wrapped in the following batch file (replace USER_NAME with your own)

call C:\Users\USER_NAME\Anaconda3\Scripts\activate.bat
call conda activate xnatuploader
call xnatuploader upload --spreadsheet spreadsheet.xlsx --dir dir

Note that this will only work if the underlying xnat library can authenticate using credentials stored in a _netrc or .netrc file. See the xnatpy documentation for instructions on how to set this up.

Alternately, if you run xnatuploader manually and authenticate, it will keep an authentication token on your machine for a time period (controlled by the policy of your XNAT server, typically 72 hours). Once you've authenticated manually, a task scheduler job should work until the token expires.

Upgrading

To get the latest version of xnatuploader, type the following (at the Anaconda prompt or terminal):

pip install --upgrade xnatuploader