CERIT-SC / funnel-gdi

MIT License
1 stars 3 forks source link

Htsget data-retrieval with encryption #7

Open mrtamm opened 2 months ago

mrtamm commented 2 months ago

Add support for requesting genomic data in encrypted (crypt4gh) format.

Htsget (more specifically htsget-rs) is supposed to support this functionality, as described here: https://github.com/umccr/htsget-rs/blob/194457b077d3387414800fd5ffcb2a2141a6d1b3/docs/crypt4gh/ARCHITECTURE.md

Funnel needs to implement the referred htsget protocol for downloading encrypted files.

This means extending the current htsget protocol implementation:

  1. forward client-public-key in HTTP headers
  2. detect that the referred file is encrypted (c4gh)
  3. forward server-public-key in HTTP headers when downloading parts
  4. decrypt the downloaded data
  5. configuration parameters for the key-pair and the server-public-key
xhejtman commented 2 months ago

I believe, client keys should be ad-hoc generated by the funnel.

mrtamm commented 2 months ago

Initial development is here: https://github.com/mrtamm/funnel-gdi/tree/dev-htsget-crypt4gh

At the moment, I still need to do more full-scale testing (and potentially fixing) before reaching a PR. So I'm estimating May 8 for the PR.

xhejtman commented 2 months ago

From slack:

"inputs": [
    {
      "name": "pub key input",
      "description": "Public C4GH key.",
      "type": "FILE",
      "path": "/tmp/c4gh.pub",
      "content": "PUBKEY AS STRNG"
    }
  ],
mrtamm commented 1 month ago

HTSGET storage configuration in Funnel now looks like this:

HTSGETStorage:
  Disabled: false
  Protocol: https
  SendPublicKey: false

When SendPublicKey is true, Funnel will generate the key-pair if existing keys (files) are not found. Funnel itself cannot detect if the Htsget server sends the data encrypted or not. So user must specify it explicitly.

Protocol specifies the replacement protocol for calling HTSGET API (default is https).

mrtamm commented 1 month ago

Overview about the local testing setup.

Testing dependencies

  1. htsget-rs: https://github.com/umccr/htsget-rs/tree/crypt4gh
  2. htsget: https://pypi.org/project/htsget/
  3. crypt4gh: https://pypi.org/project/crypt4gh/

Htsget Docker Image

Inside htsget-rs directory:

cp deploy/Dockerfile .
docker build -t ghcr.io/umccr/htsget-rs:latest .

Htsget configuration

formatting_style = "Compact"

# The main ticket-server:
ticket_server_addr = "0.0.0.0:8080"

# The local-data-server:
data_server_enabled = true
data_server_local_path = "/data" # This is INSIDE the container

[[resolvers]]

[resolvers.storage]
response_url = "http://localhost:9091/"
forward_headers = true

[resolvers.storage.endpoints]
file = "http://localhost:8081/"
index = "http://localhost:8081/"

[resolvers.object_type]
send_encrypted_to_client = true
private_key = "/crypt4gh/private.key"
public_key = "/crypt4gh/public.key"

Folder-structure for Docker-Compose Data

./htsget/
  - crypt4gh/
    - private.key
    - public.key
  - data/
    - test_data.vcf.gz.c4gh
    - test_data.vcf.gz.tbi
  - htsget.toml

Generate private and public keys using command: crypt4gh-keygen -f --nocrypt --sk private.key --pk public.key

Sample VCF for testing: https://github.com/EGA-archive/beacon2-ri-tools/blob/main/test/test_1000G.vcf.gz

Generate index (TBI) for the VCF: bcftools index -t test_data.vcf.gz

Htsget on Docker-Compose

services:
  htsget:
    container_name: htsget
    image: ghcr.io/umccr/htsget-rs:latest
    command: htsget-actix --config /etc/htsget.toml
    ports:
      - "9090:8080"
      - "9091:8081"
    volumes:
      - "./htsget/data:/data:ro"
      - "./htsget/crypt4gh:/crypt4gh:ro"
      - "./htsget/htsget.toml:/etc/htsget.toml:ro"

After docker compose up, call the API (for testing):

curl -H 'client-public-key: Qjn...' 'http://localhost:9090/variants/test_1000G?class=header'

Htsget configuration in Funnel

Copy config/default-config.yaml to my-config.yaml and modify HTSGETStorage:

HTSGETStorage:
  Disabled: false
  Protocol: http
  SendPublicKey: true

Htsget storage testing

# copy the keys:
cp htsget/crypt4gh/private.key .private.key
cp htsget/crypt4gh/public.key  .public.key

go run . storage get "htsget://localhost:9090/variants/test_data?class=header" header.vcf.gz -c my-config.yaml
MalinAhlberg commented 1 month ago

First of all, really nice that you are implementing support for htsget! I'm testing this implementation together with starter-kit-htsget + starter-kit-storage-and-interfaces, and have two questions:

Thanks :)

mrtamm commented 1 month ago

Hi and thank you for the feedback!

I am not able to use a private key that uses a passphrase, even if the passphrase is empty. Is it possible?

As shown above, I used --nocrypt option to generate the keys without a passphrase. So it should work in that case. However, if the environment, where funnel is running, contains an environment variable C4GH_PASSPHRASE, crypt4gh would use that value for decrypting the key. At the moment, this is the only way to make it work. Theoretically, it would be possible to add this passphrase to funnel configuration file, too.

Would it not be safer/sounder to decrypt the file inside of the execution container, instead of first decrypting and then copying the decrypted file to the container?

It depends. If it has to be done in the container, this (additional) task would be left to the container developer. However, the private key is already in the host system, so this decryption could be executed outside of the container as well. For the sake of user experience, I decided to decrypt the file beforehand, and leave the security task for the maintainer of the host system (where funnel is running).

This is how I figured it out how it would work best but if there are more ways to solve it, I would gladly discuss them.

MalinAhlberg commented 1 month ago

Thanks for the answers, @mrtamm ! I think your reasoning makes sense, and I now have the complete setup running :+1: .

A side note, in case someone else finds it useful: the htsget command (cmd1) might hang, if there is something wrong with the decryption (cmd2) so that it stops reading from the pipe. For example, if an old version of crypt4gh is used.

mrtamm commented 1 month ago

Thanks for the feedback! I need to check, indeed, how the problems could be detected when something goes wrong with the commands. Secondly, I'm also considering support for other crypt4gh implementations (they have different CLI flags), or otherwise integrating decryption to the Funnel source code. Estimating this to be ready by the end of June.