Closed victorskl closed 3 years ago
Hi @jb-adams just added CORS support for Go chi server. Probably also related to #23. Please kindly review. Thanks.
/cc @ohofmann @brainstorm @reisingerf
thanks @victorskl ! Can you open this PR against ga4gh:develop
rather than ga4gh:master
? I'm happy to review
Sure @jb-adams sorry about that. I have changed the base branch to develop
. May I suggest, set the develop
as default branch in GitHub repo setting to help future PRs, I reckon.
Also; just like to mention that we have this htsget Go server backend running in our AWS; basically API Gateway v2 fronting auto-scalable ECS cluster in private subnet -- the related CDK IaC stack and architecture avail at our infra repo. Particularly, our Go server deployment config only allow (limit) origins from our data portal domains -- that runs igv.js to call this Go server htsget endpoint through API Gateway as proxy. With that, we aim to terminate handling of CORS, SSL, AuthZ, throttling, etc at API Gateway. It works great so far.
Next, we will probably tackle AuthZ and, how to hook up data source registry to our primary data -- i.e. mapping htsget endpoint ID to our metadata stores (i.e. custom metadata database, Gen3, S3, GDS, etc) -- in a sense, that may lead to DRS and Passport/Visa in GA4GH, I reckon. Still new to GA4GH. I have yet to catch up with https://github.com/ga4gh/data-repository-service-schemas/issues/339 discussion (not sure related) and, to understand the concept around it. Please kindly share pointers on dynamically mapping of htsget endpoint ID data source registry from some metadata store, in terms of GA4GH space. Thanks.
Also /cc @andrewpatto
thanks @victorskl , I've updated the default branch to develop
.
our reference deployment (at https://htsget.ga4gh.org
) is currently deployed in a similar way as yours, i.e. with API Gateway sitting in front of an ECS cluster running the docker image, good to see similar infrastructure here
the data source registry is a concept we introduced to allow flexible configuration of the server to stream data from custom data sources. it's a concept we'll likely replicate across our reference implementations (e.g. DRS). happy to provide any pointers on how to wire it up
Please kindly share pointers on dynamically mapping of htsget endpoint ID data source registry from some metadata store, in terms of GA4GH space.
if you have files that you'd like to stream over htsget, and they follow a consistent naming scheme, and are located under a consistent file path, then this is possible to model as an entry in the data source registry. A single object (i.e. data source) in the data source registry contains both a pattern
and a path
.
The pattern
indicates a regex pattern that the server will try to match a requested id
to. For each request, the id
is evaluated against each registered pattern
, and the first match found will be the data source used. Then, the path
expresses the URL or local file path that objects of that id pattern resolve to. Capture groups in the pattern
can be used to populate the path
template to provide a custom URL/path to an object based on a requested ID.
e.g. Let's say you had 3 BAMs within one directory on a cloud storage bucket. The URLs to access them would be:
https://data.somesite.org/datasets/A/bam/00001.bam
https://data.somesite.org/datasets/A/bam/00002.bam
https://data.somesite.org/datasets/A/bam/00003.bam
A single data source could capture all files in this directory with the following config, for example:
{
"pattern": "^dataset.A.(?P<id>\d{5})$",
"path": "https://data.somesite.org/datasets/A/bam/{id}.bam"
}
The server evaluates the requested id for a match to the pattern
. ie. does this is have a 5 digit number after "dataset.A."? Then, if the pattern matches, the captured id
from the regex is used to populate the URL path
template to the BAM.
Hope this helps, please let me know if you need any further explanation.
I've updated the default branch to develop.
Awesome!
The pattern indicates a regex pattern [...]
Thanks for the pointer and explanation. Appreciate it! We still have a metadata store i.e. another database + API backend to serve logical aggregate view (collection of the actual files BAMs, VCFs, etc... in S3 buckets) of an entity of interest. I guess, we will experiment and try it out around this source pattern/regex and/or, DRS or, what not...
Anyway, don't want to digress too much about this PR; please feel free to review and suggest amending, if any.
Thanks
@jb-adams I have rebased it to develop
branch, just in case.
Anyway PR build fail due to Travis hitting GitHub rate limit while downloading samtools:
$ chmod 700 ${SAMTOOLS} && source ${SAMTOOLS} && samtools --version
--2021-01-25 04:22:02-- https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 429 too many requests
2021-01-25 04:22:02 ERROR 429: too many requests.
Could you please restart build when you are able. Highly recommended to migrate CI build to GitHub Action since things are already in GitHub eco-system and implicitly avoid such rate limit, I reckon.
Added S3 protocol and Private Bucket support
Please also refer to correspondent CDK stack README for example deployment use case and igv.js integration through AuthZ
@jb-adams What do you think about this PR as-is right now? I reckon support for private S3 buckets is an interesting feature for other groups (present and future)?