ga4gh / fasp-scripts

Apache License 2.0
11 stars 7 forks source link

Getting started guide #29

Open jb-adams opened 3 years ago

jb-adams commented 3 years ago

@ianfore From our Connect call, I'm thinking it might be helpful to create a "getting started guide" as an entrypoint to being able to run these scripts. As we want more of the community to use and contribute to these scripts, we'll want to provide an easy path for them to working with this repo.

This guide could take the form of a one-pager within the repo that explains how to go about getting registered for the various services/platforms, how to configure keys locally, test scripts with expected output, etc. It would take a user/researcher with no pre-existing identity with any of the FASP platforms to being able to run most, or all scripts. Since I fall into this category (I only have an ID with CGC and Cavatica), I'm happy to take notes on the process and collate into a getting started guide.

Does this sound useful?

@briandoconnor @mbarkley

ianfore commented 3 years ago

Yes this is definitely needed. Some of the existing issues are intended to help with that. #5, #7, #8, #26 and #28 for example. Some have been dealt with but the repo is not yet at a point where it's easy to pick up and get started.

I have to take some responsibility for this, so am self-assigning, but any help would be welcome.

ianfore commented 3 years ago

Created starter branch.

jb-adams commented 3 years ago

I'm happy to work on this as well, I can bring the perspective of someone who doesn't have existing accounts with any of these services, am I able to use the guide and run scripts successfully.

If it works for you, I can create a branch on my fork off of your fork

ianfore commented 3 years ago

Yes, please fork as needed.

I was going to go back to square one and create a new python environment using conda or venv and check the dependencies bottom up.

I'm still reluctant to list everything used in the scripts as a dependency. Not all users would want to run all scripts - so I wanted to avoid them having to install packages they wouldn't use.

A reasonable strategy would be to list as dependencies only those modules used by the fasp package, and leave out anything which are used only within specific scripts, but we'll have to review that as we go.

Besides that we also need to consider this as 'tutorial' to get someone started and able to do something useful with minimal preliminaries.

jb-adams commented 3 years ago

For simplicity i've listed out all the dependencies in setup.py so that python setup.py install takes care of all dependencies. if we want something more sophisticated we can consider a CLI option when running install to handle different "dependency groups", but for now i think it's good to have all dependencies transparently listed and installable via one command.

I'm trying to work up to running FASPScript2.py successfully, but I don't have the FASP_SETTINGS environment variable set. Can you explain what this is so I can translate it into the guide?

jb-adams commented 3 years ago

never mind, found the FASP_SETTINGS example

jb-adams commented 3 years ago

hi @ianfore , I'm trying to run FASPScript2.py to completion, while translating all the workup steps into the getting started guide. Currently, I'm getting a 403 error when trying to run the bdcquery (line 47). Who is hosting this service, and who should I reach out to for access?

ianfore commented 3 years ago

That query is against a BigQuery table which contains controlled access data. I created the table from a dbGaP file to which I have been granted access, but I can't grant that access to anyone else. We could follow through on that but it would distract from what we're trying to get done. (Created issue #31 which we can pursue in parallel to deal with the access issue).

More directly we should find a script that fits the criterion to "get someone started and able to do something useful with minimal preliminaries". FASPScript2 is nice, in that it's federating two sources, but it doesn't fit the bill for "minimal preliminaries".

FASPNotebook06 would be better. Neither the Search nor DRS steps will require any authenticated access. However the WES step will require you to get a log on to a WES Server. I can't see a way around that for any WES server because I don't know of anyone prepared to give open access to compute. However, @mbarkley might grant you access to the DNAStack WES for what you want to do.

For steps beyond that; we could look at notebooks that use the various Seven Bridges WES servers. For the CRDC sponsored http://cgc.sbgenomics.com you should be able to create yourself an account with enough credits to do basic compute. The other SB instances which offer WES services (Cavatica and BioDataCatalyst) also offer some "starter" access.

General point about the scripts - I've shifted focus to the notebooks rather than the scripts. The most current work in on the notebooks because it seemed that had more relevance to the community. If that's not the case we can shift focus back to the scripts. (The notebook vs scripts question got some consideration in issue #7 if you want more background on the thinking, though it does get into the weeds of some WES issues).

jb-adams commented 3 years ago

Thanks for the clarification Ian, I'll move over to trying to replicate notebook 6 for now. I think we should provide some indication within the scripts about which ones are not possible to run based on closed access.

I think it should also be sufficient to say for certain scripts just what you mention here, i.e. "this script requires you to have an account with XYZ company and access to their WES service." Perhaps we can list out the "Platform reps" for each institute, e.g:

That way, we give people trying to run the fasp scripts a chance to jump off into setting up accounts with the related platforms, and a point of contact for further clarification if necessary.

ianfore commented 3 years ago

Yes, the access required would fall under the heading of the metadata that I felt we needed about scripts. Issue #8 touched on script metadata and which datasets scripts access, but we should revisit.

The access_keys.md page has the beginnings of some of how and where to get access to various systems.

Adding a to do in here. Creating additional issues for the would work too.

ianfore commented 3 years ago

Added details for Seven Bridges keys. In the process it required code changes so the same approach was used for the SB WES and DRS services.

Felt that, rather than providing help desk contacts, the links to home pages for each of the systems would be sufficient. In each case that leads into the standard process for getting an account on the relevant system.

ianfore commented 2 years ago

Reviewing this. Much of what we set out to do was accomplished for the tutorial at ISMB, and with the addition of Starter Kit. SK adds examples of how, as a provider, to get DRS, WES, Data Connect and Passport running.

One rationalization, for fasp-scripts, was to separate the clients from the scripts1. That was done as the fasp-client branch. Suggest that we create a separate fasp-client repository for that.

This simplifies the complex scenario access scenario required for all the scripts, and explored above. In total, the various scripts need keys for 10-12 systems2 . That's too complex to manage, and likely unnecessary for all users. Most scripts probably only need to authenticate against three or four systems, sometimes less. Best to deal with that script by script.

The tutorial handled the "if you want to run this on Seven Bridges, contact Michele or SB support desk" question above.

1 For "script" read script/notebook. 2 Passport might change that - but that vision is not yet fulfilled, and we should not be dependent on that to explore other functionality in parallel.

Suggest we have the following to do's to close this issue: