CSCfi / beacon-python

Python-based GA4GH Beacon API Server
https://beacon-python.readthedocs.io
Apache License 2.0
8 stars 4 forks source link

Add information about 1-based or 0-based data in beacon #138

Open blankdots opened 4 years ago

blankdots commented 4 years ago

Proposed solution

Have a info in the response that specifies if the data in the beacon is 0-based or 1-based. While the recommendation for the API is to be 0-based https://github.com/ga4gh-beacon/specification/issues/251 , that might not always be the case. Hence we will add some information to the API that a beacon deployment can specify what kind of data it has.

This is GA4GH related.

DoD (Definition of Done)

infoobject contains a key that specifies if 0-base or 1-base.

Testing

Unit test and peer review.

blankdots commented 4 years ago

based on @teemukataja offline conversation:

blankdots commented 4 years ago

Solved in beacon network UI with: https://github.com/CSCfi/beacon-network-ui/pull/32/commits/2ffc7000fefb8d38a3c70e9240453de0d28f4784

teemukataja commented 4 years ago

Three solutions come to mind:

  1. Declare the file type (because the file types have specifications, and that might convey the information to the user)
    {
    "datasetAlleleResponses": [
        {
            ...,
            "info": {
                "fileType": "vcf"
            }
        }
    ]
    }
  2. Declare the coordinate base system, regardless of file type.
    {
    "datasetAlleleResponses": [
        {
            ...,
            "info": {
                "coordinateBase": 1
            }
        }
    ]
    }
  3. Or combine them both
    {
    "datasetAlleleResponses": [
        {
            ...,
            "info": {
                "fileType": "vcf",
                "coordinateBase": 1
            }
        }
    ]
    }

We could get the fileType from the input datafiles *.vcf in beacon_init, so they are inserted into the database with the metadata.

Concerns

What if a dataset contains multiple file types? Then we could use arrays instead "fileType": ["bam", "vcf"], and "coordinateBase": [0, 1] or "coordinateBase": "mixed", but I don't know if it's typical for a dataset to contain mixed filetypes and mixed coordinate base systems... Will need to investigate.