ga4gh / Get-Started-with-GA4GH-APIs

ISMB 2022 tutorial on FASP and Starter Kit
Apache License 2.0
5 stars 1 forks source link

Clarifications for session 5 #16

Closed yash-puligundla closed 2 years ago

yash-puligundla commented 2 years ago
  1. After requesting /search on data connect server which provides the Drs uri for cram, crai, and bundle, what is the next step? Do we take these URI's and go to Drs server?
  2. How is the Drs URI resolved to an accessible s3 URL? Is there an endpoint that does this or are we manually doing the resolution?
  3. What is the Drs endpoint that gives the information about passport brokers and required visas?
  4. How can I remove the default test visas (StarterKitDatasetsControlledAccessGrants, DatasetAlpha, DatasetBeta, DatasetGamma) available in the Passport broker?
  5. What is the Header field name for adding passport jwt to Drs request? Can you provide an example request?
  6. Does running python3 scripts/add-known-visas-to-drs.py and python3 scripts/populate-drs.py add the 1000 genome sample population-based visa information and 1000 genome sample data to Drs server? Is there a correct sequence to run these 2 scripts?
  7. What is the conclusion step of session 5? Does it end with the Drs returning the s3 URL for the requested Drs object and the participant being able to access this s3 URL to download a file?
  8. Since, we use DRS in both session 4 and session 5, and only the Drs in session 5 uses passports for authorization. How is this configured?
  9. For the data connect 1000 genomes table, we took a subset based on the rules below, which resulted in 200 rows -
    
    2 unique records in sex
    {'male', 'female'}

26 unique records in population_code {'ITU', 'ASW', 'JPT', 'MSL', 'CHS', 'CDX', 'YRI', 'ACB', 'MXL', 'PUR', 'FIN', 'GWD', 'LWK', 'GIH', 'CLM', 'TSI', 'PEL', 'PJL', 'GBR', 'CHB', 'BEB', 'ESN', 'KHV', 'CEU', 'IBS', 'STU'}

30 unique records in population_name {'Bengali,Bengali', 'African Ancestry SW', 'Punjabi', 'Dai Chinese', 'Gambian Mandinka', 'Yoruba', 'British', 'Japanese', 'Iberian', 'African Caribbean', 'Mende', 'Southern Han Chinese', 'Han Chinese', 'Luhya', 'Kinh,Kinh Vietnamese', 'Toscani', 'Luhya,Luhya', 'Kinh Vietnamese', 'Tamil', 'Gujarati', 'Bengali', 'Finnish', 'CEPH', 'Telugu', 'Peruvian', 'Esan', 'Colombian', 'Punjabi,Punjabi', 'Puerto Rican', 'Mexican Ancestry'}

5 unique records in superpopulation_code {'AMR', 'EAS', 'EUR', 'SAS', 'AFR'}

8 unique records in superpopulation_name {'East Asian Ancestry', 'European Ancestry', 'African Ancestry', 'American Ancestry', 'East Asia (SGDP),East Asian Ancestry', 'South Asian Ancestry', 'South Asia (SGDP),South Asian Ancestry', 'African Ancestry,Africa (SGDP)'}


I see that the Drs database has 46 rows of 1000 genome samples in it. Is there a different set of rules that were used to obtain this subset? If yes, then let me know the rules used, so I will make sure that data connect uses the same subset.

Hi @jb-adams. I have a few questions regarding session 5. Would appreciate your help with these. Thank you!!
ianfore commented 2 years ago

Answers to some of the questions below.

  1. After requesting /search on data connect server which provides the Drs uri for cram, crai, and bundle, what is the next step? Do we take these URI's and go to Drs server? Yes
  2. How is the Drs URI resolved to an accessible s3 URL?
    • Either a) for a host based URI, build the drs end point from the host name --- e.g. for drs://nci-crdc.datacommons.io/098e18d4-5ece-4bc6-9a79-68f5082da9bc --- the DRS endpoint would be https://nci-crdc.datacommons.io/ga4gh/drs/v1 b) for a prefix based URI, resolve the drs prefix --- e.g. for drs://dg.4DFC:098e18d4-5ece-4bc6-9a79-68f5082da9bc --- Use the GA4GH Service registry (or identifiers.org) to find the host --- See fasp.loc.DRSMetaresolver for examples
    • Extract the drs_id part from the URI -- e.g. in both the example above the drs_id is 098e18d4-5ece-4bc6-9a79-68f5082da9bc
    • Call /objects/ -- From the response, determine which access method you want to use -- Access methods correspond to copies of the object on different cloud providers -- And possibly in multiple regions within the same cloud provider -- You pick the one that works for where you will do compute
    • Call /objects//access/ -- include authentication if needed_
  3. What is the Drs endpoint that gives the information about passport brokers and required visas?
    • Still under development
    • Check with Jeremy
  4. What is the conclusion step of session 5? Does it end with the Drs returning the s3 URL for the requested Drs object and the participant being able to access this s3 URL to download a file?
    • Passing it to a WES service
    • Suggest not downloading the file. The whole point is to compute on it in place

All of the above are illustrated in many of the fast-scripts notebooks. A good example is in https://github.com/ga4gh/fasp-scripts/blob/master/notebooks/FASPNotebook18-GTEXExample-AWS.ipynb You would have to look at the clients in the fasp-scripts repo to see the underlying REST API calls.

yash-puligundla commented 2 years ago

Thank you @ianfore

ianfore commented 2 years ago

@yash-puligundla you'll find that identifiers.org will not only resolve the identifier prefix, but also redirect you straight to the DRS /objects response. so you would take the drs://dg.4dfc:098e18d4-5ece-4bc6-9a79-68f5082da9bc example above and use identifiers.org as follows: https://identifiers.org/dg.4dfc:098e18d4-5ece-4bc6-9a79-68f5082da9bc

That behavior is down to the way that Michael, Binam and/or their colleagues registered the prefix with identifiers.org. The url_pattern is the thing that does it. The relevant details are shown on this page https://registry.identifiers.org/registry/dg.4dfc#! Other DRS servers would do the same.

yash-puligundla commented 2 years ago

@yash-puligundla you'll find that identifiers.org will not only resolve the identifier prefix, but also redirect you straight to the DRS /objects response. so you would take the drs://dg.4dfc:098e18d4-5ece-4bc6-9a79-68f5082da9bc example above and use identifiers.org as follows: https://identifiers.org/dg.4dfc:098e18d4-5ece-4bc6-9a79-68f5082da9bc

That behavior is down to the way that Michael, Binam and/or their colleagues registered the prefix with identifiers.org. The url_pattern is the thing that does it. The relevant details are shown on this page https://registry.identifiers.org/registry/dg.4dfc#! Other DRS servers would do the same.

This is good to know. Thank you, Ian. But, I doubt if this is the case with the starter kit. "https://identifiers.org/HG00284.1kgenomes.wgs.downsampled.bundle" is invalid, which makes me think it is not registered with identifiers.org.

I believe we can use the first option you listed above

a) for a host based URI, build the drs end point from the host name --- e.g. for drs://nci-crdc.datacommons.io/098e18d4-5ece-4bc6-9a79-68f5082da9bc --- the DRS endpoint would be https://nci-crdc.datacommons.io/ga4gh/drs/v1

Here is an example drs uri from data connect: "drs://localhost:5000/HG00284.1kgenomes.wgs.downsampled.bundle" In this example, I am not sure how I would get the drs object id from this uri

Is the drs object id = "HG00284.1kgenomes.wgs.downsampled.bundle" or is there a UUID that I need to obtain from somewhere? (I think this might be very specific to the starter kit implementation)

ianfore commented 2 years ago

Just posted this as a separate issue. Then I saw your last paragraph. The new issue essentially addresses what you ask. I wanted to separate the id issue from the resolution issue.

Back to resolution. No, identifiers.org knows nothing about your local host.

Prefix based DRS ids are a good thing though,

Watch this space!

ianfore commented 2 years ago

The prefix resolver that I understood could be run locally is Bioregistry. First off though, note that it can be used in the same way as identifiers.org. https:/bioregistry.io/dg.4dfc:098e18d4-5ece-4bc6-9a79-68f5082da9bc Same DRS id and prefix as above - different metaresolver.

Running Bioregistry locally See https://github.com/ga4gh/ismb-2022-ga4gh-tutorial/tree/main/supporting/bioregistry

jb-adams commented 2 years ago

Hi @yash-puligundla,

  1. After requesting /search on data connect server which provides the Drs uri for cram, crai, and bundle, what is the next step? Do we take these URI's and go to Drs server?

Yes, you can take these URIs and request the DRS Object from DRS. You will get an unauthorized error because you don't have a passport yet, but this is good to show there is access control

  1. How is the Drs URI resolved to an accessible s3 URL? Is there an endpoint that does this or are we manually doing the resolution?

The DRS spec mandates that the DRS URL be resolvable to an HTTP(S) URL via a simple pattern: drs://{host}/{id} -> http(s)://{host}/ga4gh/drs/v1/objects/{id}, so you could show that part of the spec and then have the class manually convert the DRS URL to an HTTP URL to get the DRS object.

  1. What is the Drs endpoint that gives the information about passport brokers and required visas?

There are multiple endpoints I'll outline:

So you can see that there is an OPTIONS request analog to both single and bulk DRS Object requests. You can review the controller for more info on expected payload/response.

  1. How can I remove the default test visas (StarterKitDatasetsControlledAccessGrants, DatasetAlpha, DatasetBeta, DatasetGamma) available in the Passport broker?

To do this, you'll have to create a new sqlite database rather than the default database that's bundled in the docker image, because that one has the test dataset. For reference, look at what I did for DRS in session 4. There's a resources/drs/db folder that, when drs-migrate and drs-dataset are run (in the docker-compose), as db gets built there. That db gets mounted into the drs server at /db. The config file at resources/drs/config/config.yml indicates the database URL as the file at /db/drs.db (relative to the docker container, not the host machine).

So you want to do something similar for passport. Use the SQL script that creates the passport table schema, but not the script that adds the test dataset. Then mount that db into the docker container, and use a config file to point the app to the non-default db. This will wire the app to a db with the correct schema, but no data/visas in it. @emre-f can then develop a script to make administrative POST requests to the passport broker to create the visas, and assign them to the researcher/user.

  1. What is the Header field name for adding passport jwt to Drs request? Can you provide an example request?

It's in the POST request body:

{
    "passports": [
        "{jwt}"
    ]
}

So a field called passports that holds an array of one or more passport JWTs. You should only need to provide one since we're not demonstrating multi-broker.

  1. Does running python3 scripts/add-known-visas-to-drs.py and python3 scripts/populate-drs.py add the 1000 genome sample population-based visa information and 1000 genome sample data to Drs server? Is there a correct sequence to run these 2 scripts?

add-known-visas-to-drs should be run first, followed by populate-drs.

  1. What is the conclusion step of session 5? Does it end with the Drs returning the s3 URL for the requested Drs object and the participant being able to access this s3 URL to download a file?

Yeah that's a good place to end it. I don't think it's necessary to run the workflow via WES again, you could just state that "now that we can access controlled access DRS Objects using our passport, we are technically ready to run the workflow as we did in Session 4"

  1. Since, we use DRS in both session 4 and session 5, and only the Drs in session 5 uses passports for authorization. How is this configured?

DRS in session 4 makes no use of the passport/visa related tables, so it can be completely ignored in session 4. session 5 makes use of the passport/visa tables and assigns DRSObjects to Visas, which essentially states, "for this DRSObject, the researcher needs to present this visa to obtain access". The wiring of DRSObjects to required visas is handled by the 2 python scripts.

  1. For the data connect 1000 genomes table, we took a subset based on the rules below, which resulted in 200 rows. I see that the Drs database has 46 rows of 1000 genome samples in it. Is there a different set of rules that were used to obtain this subset? If yes, then let me know the rules used, so I will make sure that data connect uses the same subset.

If you look at the S3 bucket I made for the tutorial, there are 2 directories of CRAMs. I believe they are "highcov" and "lowcov." The CRAMs/DRS IDs in highcov should be used for session 4, because that's what I used to test CNest. In Session 5, you can use the data in lowcov. The data in lowcov should be 200 CRAMs and CRAIs that correspond to the data connect dataset.

The script in session 5 should populate DRS with 600 DRS Objects (1 CRAM, 1 CRAI, 1 Bundle), as well as a FASTA file, index, BED file.

yash-puligundla commented 2 years ago

Thank you very much @jb-adams

yash-puligundla commented 2 years ago

Hi @jb-adams I am trying to change the passport issuer from "https://ga4gh.org/" to something else. I wanted to check with you if I can use a config file for the passport-broker container to do this.

Here is the config.yaml file, but it doesn't work as expected.

passport-broker:
  brokerProps:
    passportIssuer: http://localhost:4455/

Can you please take a look and see if I am doing something incorrectly? Thanks!

yash-puligundla commented 2 years ago

I was able to figure out the error in the config file. Closing this issue as session 5 content is good to go!! Thank you, Ian and Jeremy!