ga4gh / fasp-scripts

Apache License 2.0
11 stars 7 forks source link

Review FaspScript naming and documentation #8

Closed ianfore closed 3 years ago

ianfore commented 3 years ago

The number of scripts that now exist mean that simple numbering of the scripts is no longer informative.

A different approach to naming can help but it's expecting too much of a naming convention to provide adequate information about what a script does. Simple metadata about the scripts is helpful too. The table on the readme page addresses that by providing information about which clients are used in a given script. Those clients, are at the moment, specific to implementations* so one can also tell which implementations are used.

The logging performed by FASPRunner captures the necessary metadata automatically. The data in the log provided an easy way to create the table for the readme.

Beyond the metadata currently being captured it would be useful to know which data sources are queried, which workflows are being run etc.

* This specificity is open to the criticism that the clients should be more general purpose. In some cases the specific client is doing no more than wrapping the host url of the service. That does make things slightly more convenient for script writers. That does seem worthwhile. In other cases the specificity has to do with the different authentication and authorization behaviors of different implementations. Over time we might expect those differences to disappear and common clients will work.

ianfore commented 3 years ago

Notebooks created so far have attempted to put some meaning in the filename. Notebooks have the capability to be annotated to describe what they do. This is strongly encouraged. In essence the 'readme' is part of the notebook.

Scripts still need more metadata to say what they do. FASPRunner continues to be useful for that. Notebooks should tell FASPRunner or FASPLogger their name via program='scriptName/notebookName'.

ianfore commented 3 years ago

Added ability to FASPRunner to obtain name of the calling notebook. The method used derives from discussions in this jupyter issue. This supersedes the need to provide the program name to FASPRunner.

ianfore commented 3 years ago

Added readme for Notebooks with graphic table summarizing notebooks.

jb-adams commented 3 years ago

2 thoughts here:

  1. I think a consistent notebook markdown header explaining notebook title, description, and what services (and therefore what accounts the user needs) would go a long way here. We can put the header as the first cell in each notebook.
  2. I think we should also differentiate between notebooks that are in an "unfinished" state vs those that are in a "published" state (i.e. notebooks demonstrating a complete use case and are largely stable). The unfinished notebooks don't need any specific formatting/documentation, but we want our published notebooks to be well-documented, such that someone coming to the repo can see our polished scripts and run for themselves. This will allow us to refer newcomers to our most polished vignettes, while allowing us to do initial PoC work in a separate space (this could simply be 2 different folders, "draft" and "published" under the "notebooks" folder).
ianfore commented 3 years ago

Yes a consistent header would help. FASPRunner does what it can to gather information about a script. The getFASPIcon function is one way that it uses that information. I've been manually adding the generated icon to the beginning of relevant scripts. A more comprehensive header could be generated.

A template FASPNotebook also seems a possibility - which could have the markdown header already in place. Was also wondering whether any of the standard 'metadata' markup might be used; Bioschemas, for example. Would have to check whether they have fields for a workflow.

On 2. The "in progress" folder was intended as the home for anything unfinished. "Unfinished", and "run for themselves", is always going to be tricky. It seems reasonable that we make available notebooks that can only be run by people who have access to the particular datasets. For one thing, they can learn from them something applicable to their own datasets even if they don't have access to the datasets used in the example. They can also request access to the dataset. I hesitate to mark these as "in progress".