MIT-LCP / physionet-build

The new PhysioNet platform.
https://physionet.org/
BSD 3-Clause "New" or "Revised" License
56 stars 20 forks source link

Develop API for integrating with annotation platforms #1047

Open tompollard opened 4 years ago

tompollard commented 4 years ago

There are several existing platforms that could be used to gather useful annotations for PhysioNet datasets. This needs a lot more thought, but as a rough idea it would be good to develop a general API that:

  1. allows an external platform to request a data file for annotation, perhaps along with associated metadata (details to be determined, but this might include existing annotations).
  2. allows the external platform to submit a structured annotation back to PhysioNet.

Metadata profile for an annotation

The structure of the annotation will need to be developed. At minimum, the metadata should probably include:

Metadata profile for an annotation task

One of the major challenges is understanding how the API can be made generalizable across PhysioNet, ideally to support multiple data types and modalities (images, waveforms, notes, etc). It feels like the annotation task will require a formal definition that would state things like:

Providing an interface for the annotation functionality

Annotation tasks may be driven by the research question, and there may be multiple annotation tasks for a single dataset. We need to come up with a simple way of allowing PhysioNet users to propose and implement an annotation task. My suggestion is that we do this with the use of a new "annotation" project type (see https://github.com/MIT-LCP/physionet-build/issues/1032).

Summary of tasks

So in summary, some good first steps might be to:

  1. Design a metadata profile for a generalizable annotation. We can probably reuse or build on the format used by existing annotation platforms that we have looked at.
  2. Design a metadata profile for an annotation task.
  3. Review whether using the existing project functionality (https://github.com/MIT-LCP/physionet-build/issues/1032) could be modified to allow a new project type to be used for defining tasks and storing annotations.
  4. Design the API!
Lucas-Mc commented 4 years ago

@tompollard suggested GraphQL as a possible API and it looks great! It also seems like Graphene will be the best way for us to provide an easy interface with Django.

Lucas-Mc commented 4 years ago

Hey @tompollard, I have begun to write out the annotation model here:

class AnnotationLabel(models.Model):
    """
    A way to save and edit annotation labels for signals.
    """
    project = models.OneToOneField('project.PublishedProject', related_name='ann',
        on_delete=models.CASCADE)
    edited_by = models.ForeignKey('user.User', related_name='ann_editor',
        on_delete=models.CASCADE)
    creation_datetime = models.DateTimeField(auto_now_add=True)
    platform_name = models.CharField(max_length=150, null=True)
    record_name = models.CharField(max_length=150, null=True)

My thoughts are that:

I think that this may be the most generalized that we can get when it comes to sharing annotation models. For example, finding similarities between the actual annotation structure of signals and images may be difficult.

Lucas-Mc commented 4 years ago

As for potential annotation structures, one that is particularly appealing for signals may be the format used by Label Studio. This structure can be used for both region [PR interval, QRS complex, etc.] (setting a start and stop time) and beat [Normal, AFIB, etc.] (setting the stop time to null / same time as the start time) annotations. Of course, we can edit and modify this how we like but this may be a good start.

See an example of input annotations and output labeled annotation JSON here:

Screen Shot 2020-06-09 at 10 45 10 AM

[
    {
        "id": "gyV6XOeyCz",
        "from_name": "label",
        "to_name": "audio",
        "source": "$url",
        "type": "labels",
        "original_length": 3.774376392364502,
        "value": {
            "start": -0.004971698554622573,
            "end": 0.20349676497713773,
            "labels": [
                "Politics"
            ]
        }
    },
    {
        "id": "PJqb8mmmsC",
        "from_name": "label",
        "to_name": "audio",
        "source": "$url",
        "type": "labels",
        "original_length": 3.774376392364502,
        "value": {
            "start": 0.39002117971608113,
            "end": 0.6698078018244963,
            "labels": [
                "Business"
            ]
        }
    },
    {
        "id": "xcHF2NJUcs",
        "from_name": "label",
        "to_name": "audio",
        "source": "$url",
        "type": "labels",
        "original_length": 3.774376392364502,
        "value": {
            "start": 0.867304240959848,
            "end": 3.127541266619986,
            "labels": [
                "Education"
            ]
        }
    }
]
Lucas-Mc commented 4 years ago

Currently the attributes of the WFDB Annotation class used for writing the WFDB-format annotation files are:

[ 'ann_len', 'aux_note', 'chan', 'contained_labels', 'custom_labels', 'description', 'extension', 'fs',
'label_store', 'num', 'record_name', 'sample', 'subtype', 'symbol']

ann_len : int
    The number of samples in the annotation.
aux_note : list, optional
    A list containing the auxiliary information string (or None for
    annotations without notes) for each annotation.
chan : ndarray, optional
    A numpy array containing the signal channel associated with each
    annotation.
contained_labels : pandas dataframe, optional
    The unique labels contained in this annotation. Same structure as
    `custom_labels`.
custom_labels : pandas dataframe, optional
    The custom annotation labels defined in the annotation file. Maps
    the relationship between the three label fields. The data type is a
    pandas DataFrame with three columns:
    ['label_store', 'symbol', 'description'].
description : list, optional
    A list containing the descriptive string of each annotation label.
extension : str
    The file extension of the file the annotation is stored in.
fs : int, float, optional
    The sampling frequency of the record.
label_store : ndarray, optional
    The integer value used to store/encode each annotation label.
num : ndarray, optional
    A numpy array containing the labelled annotation number for each
    annotation.
record_name : str
    The base file name (without extension) of the record that the
    annotation is associated with.
sample : ndarray
    A numpy array containing the annotation locations in samples relative to
    the beginning of the record.
subtype : ndarray, optional
    A numpy array containing the marked class/category of each annotation.
symbol : list, numpy array, optional
    The symbols used to display the annotation labels. List or numpy array.
    If this field is present, `label_store` must not be present.

These are some of the things that we should consider when building this new annotation model, especially if we decide to incorporate some of the functionality of Label Studio. I think some of these may be able to be cut out, but should we keep them for compatibility if we decide to write a conversion method in the future?

*Some background on the conversion issue, @tompollard suggested, and I agreed, that it would be best to store these labels in XML (possibly JSON) format since it's easier to access and is much more flexible. If someone wanted these annotations in WFDB format, then we could have a conversion method for that.

Lucas-Mc commented 4 years ago

Label Studio is releasing a time-series dedicated annotation platform which allows the user to make annotations for both ranges of times and singular times. Here is what the demo looks like:

Screen Shot 2020-07-27 at 8 29 29 AM

You'll note that the user can specify the event they wish to annotate and then perform the desired annotation using a double-click for a singular time point annotations and click-and-drag for time range annotations. You can also see the previous completions done which we can use to track multiple user who wish to annotate a single project. Additionally, we have the ability to set a ground truth set of annotations if we ever desire that functionality. Here is the resulting JSON (note single time annotations are saved with the same start and end time):

Result

[
    {
        "id": "QKaimQjoTQ",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592250821941.2595,
            "end": 1592250831927.112,
            "instant": false,
            "timeserieslabels": [
                "Event 1"
            ]
        }
    },
    {
        "id": "RSj46Dzkhe",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592250921955.7407,
            "end": 1592250921955.7407,
            "instant": true,
            "timeserieslabels": [
                "Event 1"
            ]
        }
    },
    {
        "id": "RKODZiMgsp",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592251211907.621,
            "end": 1592251211907.621,
            "instant": true,
            "timeserieslabels": [
                "Event 1"
            ]
        }
    },
    {
        "id": "nkRg1P9L5L",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592251461993.5276,
            "end": 1592251711941.2742,
            "instant": false,
            "timeserieslabels": [
                "Event 2"
            ]
        }
    },
    {
        "id": "NE7unB1-J1",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592252101985.5444,
            "end": 1592252101985.5444,
            "instant": true,
            "timeserieslabels": [
                "Event 3"
            ]
        }
    },
    {
        "id": "oHQC4dE7-u",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592252011979.126,
            "end": 1592252441979.4265,
            "instant": false,
            "timeserieslabels": [
                "Event 1"
            ]
        }
    },
    {
        "id": "M-dMRAbRxu",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592251341969.1328,
            "end": 1592251341969.1328,
            "instant": true,
            "timeserieslabels": [
                "Event 1"
            ]
        }
    },
    {
        "id": "agpadQD5i_",
        "from_name": "label",
        "to_name": "ts",
        "source": "$csv",
        "type": "timeserieslabels",
        "parent_id": null,
        "value": {
            "start": 1592252721959.5007,
            "end": 1592252851914.7446,
            "instant": false,
            "timeserieslabels": [
                "Event 3"
            ]
        }
    }
]
Lucas-Mc commented 4 years ago

I think it's worth it to note that WFDB has a function called rr2ann which converts a series of RR Intervals to annotations. I have already developed the reverse, ann2rr in the latest 3.1.0 release of WFDB-Python and will plan to add this functionality in the next release. We can use the beat annotations generated with the Label Studio annotation platform to generate RR intervals and convert them to annotations in WFDB format using WFDB-Python.