broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.71k stars 592 forks source link

All hands on deck: tool doc updates #3853

Closed sooheelee closed 6 years ago

sooheelee commented 6 years ago

Thank you everyone for your contributions towards this documentation effort. Instructions from @vdauwera to follow at this Google doc Favorite tool doc examples from @vdauwera NOW in her SOP doc. Spreadsheet from @sooheelee to be posted here

yfarjoun commented 6 years ago

Could you help us test our changes? how do I know that the documentation efforts worked? please inform picard and gatk efforts...

yfarjoun commented 6 years ago

Another question:

In picard it is customary to have a @author line, but it seems weird to have that show up in the actual documentation....there seem to be several options:

  1. Put the @author comment in a separate comment that doesn't get used in the docs (how?)
  2. Make the doc scripts strip out @author lines (can you do that?)
  3. Leave the @author lines in the documentation (seems odd)
  4. Remove the @author lines altogether (seems not nice)
sooheelee commented 6 years ago

I'm at the GATK workshop all day and can get to this this evening.

yfarjoun commented 6 years ago

Reference doesn't seem to be a class that implements CommandLineProgramGroup in Picard. given the group effort involved I suggest that adding these groups will be done centrally.

cmnbroad commented 6 years ago

I don't think the @author tags will show up in the doc; they generally don't show up in javadoc, and our doc system doesn't include them in the generated doc.@author tags are a bit sketchy to begin with, because they're block tags, so the extend all the way to the start of the next tag. But since they're stripped out, putting such a tag at the start of the javadoc can result in everything after it being silently dropped from the output. We found such a case in GATK a while back.

cwhelan commented 6 years ago

@vdauwera when running spark tools through the gatk-launch-soon-to-be-just-gatk script, you use a -- to separate arguments to the specific tool you're running from arguments that describe what sort of Spark setup you're trying to submit the job to. The most important of these is probably sparkRunner, which says what type of spark cluster you're using: LOCAL indicates that you want to simulate a spark cluster on your local machine using multithreading; SPARK indicates that you want to submit to a configured, dedicated spark cluster, in which case you have to pass the url to the master node, and GCS indicates that you want to use a cluster managed by Google Cloud Dataproc, in which case you need the name of the cluster. See https://github.com/broadinstitute/gatk#running-gatk4-spark-tools-on-a-spark-cluster for more info. Day-to-day on the SV team we primarily use dataproc, so the example command I was going to use for one tool looks like:

./gatk ParallelCopyGCSDirectoryIntoHDFSSpark \
          --input-gcs-path gs://my-bucket/my-data-directory/ \
          --output-hdfs-directory hdfs://my-dataproc-spark-cluster-m:8020/my-data \
          -- \
          --sparkRunner GCS \
          --cluster my-dataproc-spark-cluster

I wasn't sure if it was cool to use the GCS/dataproc version or if we'd rather not tie ourselves to GCP in the examples. In this particular case I feel like it might be appropriate since it's a GCS-specific tool, but my question was more about our general strategy.

sooheelee commented 6 years ago

@samuelklee, thanks for the update and suggestion. I moved CollectAllelicCounts to the Coverage Analysis category. CollectFragmentCounts isn't on the list currently so I added it to the same. I hope I'm not missing a bunch of other new tools given I missed this one.

@yfarjoun

yfarjoun commented 6 years ago

I just pushed a branch: https://github.com/broadinstitute/picard/tree/yf_documentation_update we can use that for initial testing.

On Tue, Dec 5, 2017 at 1:56 PM, sooheelee notifications@github.com wrote:

@samuelklee https://github.com/samuelklee, thanks for the update and suggestion. I moved CollectAllelicCounts to the Coverage Analysis category. CollectFragmentCounts isn't on the list currently so I added it to the same. I hope I'm not missing a bunch of other new tools given I missed this one.

@yfarjoun https://github.com/yfarjoun

  • You are now in charge of deciding whether we should include authorship in code. What the Comms team wants is for authorship to NOT show up in the gatkDoc/javaDoc. If you want to keep them, author lines should be at the bottom and formatted so they do not show up in the documentation. Geraldine is fine with completely removing them if you prefer that. There is a format trick that has javaDoc skip the author line and I can get that to you if you decide to keep some of these and @vdauwera https://github.com/vdauwera would know this or I can get you what I see in other docs. Let either of us know.
  • I can help you test your changes. I think the categories are good to go now so I will need to put these into both Picard and GATK HelpConstants.java, with the latter being a placeholder until the new Picard release is incorporated into the next GATK release, with variables that then must be included in each tool doc. I will find an example in a bit. Which tool do you want to test? @cmnbroad https://github.com/cmnbroad can explain the engineering details in engineering lingo if you need more information.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/3853#issuecomment-349404645, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0jIdprE580XBgq1jL-EIV1hFOcDyks5s9ZHAgaJpZM4QitCF .

cmnbroad commented 6 years ago

@vdauwera @sooheelee I just noticed a PR that has a command line usage example embedded in the "usageExample" field of the CommandLineProgramProperties annotation. I'm not sure if this was part of the SOP (I didn't see it mentioned), but that field is not integrated into either doc or command line help output, and should probably not be used. I didn't want to blast anything out without first mentioning it to you though.

sooheelee commented 6 years ago

Here is an example of an authorship that does not show up in our docs. IT MUST BE AT THE END OF THE JAVADOC PORTION. Remember @cmnbroad said that otherwise it cuts off the document abruptly. That is, anything that follows the * @author Tim Fennell is omitted from display.

/**
 * Attempts to check the sample identity of the sequence/genotype data in the provided file (SAM/BAM or VCF)
 * against a set of known genotypes in the supplied genotype file (in either GELI or VCF format).
 *
 * @author Tim Fennell
 */
vruano commented 6 years ago

Picard docs must reflect Picards command line syntax or their syntax as how it gets exposed thru GATK? I.e. INPUT=XXX or -I|--INPUT XXX?

yfarjoun commented 6 years ago

@vruano: my current understanding is that we will use the "legacy" format in picard, and have the website or some other black magic fix it up. (i.e. INPUT=XXX

sooheelee commented 6 years ago

It would be nice to have a few Picard tools that we feature in BPWs (e.g. MarkDuplicates) show both syntaxes:

  1. java -jar picard.jar ValidateSamFile I=reads.bam MODE=SUMMARY
  2. gatk ValidateSamFile -I reads.bam -MODE SUMMARY

I assume MODE gets one dash and not two, but really I don't know. So having such examples scattered throughout the tool chest for most often used tools would be great.

The second example could be prefaced with, e.g.

Picard tools are callable from GATK4. The equivalent command to the above invoked via the GATK launch script would be:

sooheelee commented 6 years ago

I've added a fourth tab 'categories_summaries' to the spreadsheet.

The categories and their summaries are tabulated. @cmnbroad is starting a PR to implement these categories as program groups and in the HelpConstants.

I've written summaries to my own liking and these are up for discussion. doc_cat_figure

yfarjoun commented 6 years ago

I think that it would be fine to have both legacy and modern styles in the examples, but I think having gatk in the picard docs will not pass review.

On Tue, Dec 5, 2017 at 10:41 PM, sooheelee notifications@github.com wrote:

It would be nice to have a few Picard tools that we feature in BPWs (e.g. MarkDuplicates) show both syntaxes:

  1. java -jar picard.jar ValidateSamFile I=reads.bam MODE=SUMMARY
  2. gatk ValidateSamFile -I reads.bam -MODE=SUMMARY

I assume MODE gets one dash and not two, but really I don't know. So having such examples scattered throughout the tool chest for most often used tools would be great.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/3853#issuecomment-349522774, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0ht41UsGHu_2TgrbKKNJwDepEdMZks5s9gz4gaJpZM4QitCF .

sooheelee commented 6 years ago

@yfarjoun Do you mean TRAVIS tests or human-review?

sooheelee commented 6 years ago

@yfarjoun I've built docs from your branch to see what the changes look like. Here's the PicardDoc bundle I've zipped for you: picarddoc_yossi_test.zip. View the rest by loading index.html into a browser. Here is one of the tool docs where you can see your changes:

screenshot 2017-12-06 10 18 34

Steps to build docs to see what changes look like:

[1] Download and switch to branch with changes, e.g.

git branch yf_documentation_update origin/yf_documentation_update
git checkout yf_documentation_update

[2] Generate the docs. GATK and Picard do this differently:

./gradlew clean gatkDoc

or

./gradlew clean picardDoc

[3] Open index.html from a browser. Again, GATK and Picard have different locations for their respective docs:

build/docs/gatkdoc build/docs/picarddoc

sooheelee commented 6 years ago

Finally, @yfarjoun, to format example commands, which CrossCheckFingerprints doc is currently lacking, here is an example from HaplotypeCaller.

[A] CSS-transformed forum

screenshot 2017-12-06 10 48 33

[B] gatkdoc/picarddoc view on browser

screenshot 2017-12-06 10 48 04

[C] javadoc portion of code

screenshot 2017-12-06 10 48 55

The last is hard to read so here is its contents again:

 * <h3>Usage examples</h3>
 *
 * <p>These are example commands that show how to run HaplotypeCaller for typical use cases. Have a look at the <a href='https://software.broadinstitute.org/gatk/documentation/article?id=3893'>method documentation</a> for the basic GVCF workflow. </p>
 *
 * <br />
 * <h4>Single-sample GVCF calling (outputs intermediate GVCF)</h4>
 * <pre>
 * gatk-launch --javaOptions "-Xmx4g" HaplotypeCaller  \
 *   -R reference.fasta \
 *   -I input.bam \
 *   -O output.g.vcf \
 *   -ERC GVCF
 * </pre>
 *
 * <h4>Single-sample GVCF calling with <a href='https://software.broadinstitute.org/gatk/documentation/article?id=9622'>allele-specific annotations</a></h4>

Be sure to test the example commands you put in. @vruano already found some that do not work.

sooheelee commented 6 years ago

@yfarjoun Since gatkDocs uses the javaDoc portion of the code to generate content, including the example command. It cannot pull the example command from elsewhere as @cmnbroad notes above.

@cmnbroad:

I just noticed a PR that has a command line usage example embedded in the "usageExample" field of the CommandLineProgramProperties annotation.

Is this field something we could take advantage of to differentiate content for the GATK forum versus Picard commandline help?

vruano commented 6 years ago

I guess we need to do some work on the doclet code, abstract out example code and use templates to transform it into the appropriate format/syntax depending what project is generating the documentation.

Alternatively and only if documentation html is well formed (xhtml like) then in theory we could use XSLT transformation style sheets to convert embedded code example encoded with xml/xhtml into the concrete syntax. Most major browsers support XSLT.

EDIT: The XSLT solution won't probably work since even if we try to change the output from html to xhtml, the fact that we are injecting the javadoc's html would probably break the xhtml.

sooheelee commented 6 years ago

Thanks for taking this on @vruano.

vruano commented 6 years ago

I created a separated issue about this https://github.com/broadinstitute/gatk/issues/3932

sooheelee commented 6 years ago

If your PR is ready for review, please copy-paste url to PR like @vruano did for Picard MergeVcfs in the spreadsheet AND tag @sooheelee so I can find it easily.

sooheelee commented 6 years ago

Just to be clear folks, we are using gatk directly, not ./gatk in the example commands. And if you can, for those of you yet to make your updates, please use compressed file examples. Those of you who've already put in changes, thank you and Comms can tidy those bits later.

<h3>Usage examples</h3>
<pre>
gatk --javaOptions "-Xmx4g" GenotypeGVCFs \
    -R Homo_sapiens_assembly38.fasta
    -V combined.g.vcf.gz
    -O cohort.vcf.gz
</pre>

<pre>
gatk GenotypeGVCFs \
    -R reference.fa
    -V combined.g.vcf.gz
    -O cohort.vcf.gz
</pre>
sooheelee commented 6 years ago

Mutect2 Filters list is currently absent from docs in FilterMutectCalls

@vdauwera I think these need an explanation somewhere, perhaps the tool docs, much like the Read Filters and Variant Annotations. What say you?

screenshot 2017-12-08 13 03 22

vdauwera commented 6 years ago

Those should show up in the M2 tool doc -- or FilterMutectCalls, whatever tool actually takes those args

sooheelee commented 6 years ago

Yes, they show up in FilterMutectCalls. Thanks for that pointer.

sooheelee commented 6 years ago

Just a reminder to omit --javaOptions unless your tool needs it. If it does, remember to kebabify this too to --java-options.

sooheelee commented 6 years ago

@vdauwera We need to also generate docs for Picard metrics. I think we can classify them separately like we do Read Filters and Variant Annotations. E.g. Picard Metrics.

@cmnbroad Does Barclay pull these in as well?

cmnbroad commented 6 years ago

@sooheelee If you add @DocumentedFeature to them, and assign them a groupName, groupSummary and summary, the top-level class javadoc will show up in the doc output. But since they don't have command line arguments, I think it would require additional dev work to make the fields level doc show up like it did in the old picard doc.

magicDGS commented 6 years ago

@cmnbroad - Cannot be done annotating them as @Argument as a temporary solution? Then, a custom template for the metric group can be used...

sooheelee commented 6 years ago

Since I have your attention on the matter, let me test what this looks like.

sooheelee commented 6 years ago

@cmnbroad @magicDGS, I've moved the discussion to https://github.com/broadinstitute/gatk/issues/3947.

sooheelee commented 6 years ago

Hidden tools are coming out of the woodworks and needing classification.

sooheelee commented 6 years ago

If at least one example command for a Spark tool does not utilize Spark, then I think we need this statement in each tool doc (to which it applies):

This tool can be run without explicitly specifying Spark options. That is to say, the given example command without Spark options will run locally. See Tutorial#10060 for an example of how to set up and run a Spark tool on a cloud Spark cluster.

copy-pastable format:

This tool can be run without explicitly specifying Spark options. That is to say, the given example command without Spark options will run locally. See <a href ="https://software.broadinstitute.org/gatk/documentation/article?id=10060">Tutorial#10060</a> for an example of how to set up and run a Spark tool on a cloud Spark cluster.

This tutorial isn't the best, as it only focuses on Google Dataproc, but it's what we have currently to get people started. Let's just say it's a placeholder until we get something better up.

sooheelee commented 6 years ago

@yfarjoun @vdauwera I've refined the tool categorization based on feedback on the tentative categorization. Thank you @yfarjoun for the review and feedback. The refinement is reflected in the new tabbed sheet in the shared Google Spreadsheet:1217Changes_categorization-and-assignments. I've separated out GATK vs Picard tools for each of the categories.

Here is a summary of the changes.

  1. New 11 to Diagnostics and QC: AnalyzeCovariates (from Alignment, Duplicate flagging and BQSR) GatherBQSRReports (from Alignment, Duplicate flagging and BQSR) FlagStat (from Read Data Manipulation) FlagStatSpark (from Read Data Manipulation) GetSampleName (from Read Data Manipulation) Picard BamIndexStats (from Read Data Manipulation) Picard CalculateReadGroupChecksum (from Read Data Manipulation) Picard CheckTerminatorBlock (from Read Data Manipulation) Picard CompareSAMs (from Read Data Manipulation) Picard ValidateSamFile (from Read Data Manipulation) Picard ViewSam (from Read Data Manipulation)

  2. Merge 14 tools remaining in Alignment, Duplicate flagging and BQSR with 37 tools in Read Data Manipulation. Keep latter name.

    51 tools

  3. Move these out of Read Data Manipulation: CompareDuplicatesSpark (to DxQC) ConvertHeaderlessHadoopBamShardToBam (to Other) CreateHadoopBamSplittingIndex (to Other)

  4. Move ValidateVariants into Variant Evaluation. Also: Variant Evaluation and Refinement --> Variant Evaluation VCF Manipulation --> Variant Manipulation

Let us know your thoughts. Thank you.

sooheelee commented 6 years ago

Hey developers working on GATK4 doc updates (tagging tech leads here) @cwhelan @samuelklee @ldgauthier @vruano @yfarjoun @LeeTL1220. Just a reminder (from @vdauwera and @droazen) that we want to get rid of many of the short form arguments and only keep the long form for the more obscure arguments. Meaning we keep both forms for commonly used and understood arguments.

sooheelee commented 6 years ago

Alright, everyone. If you can have your tooldoc updates merged by this Friday, that would be great. That gives me time to take stock of what is missing and update the documentation accordingly over the weekend.

sooheelee commented 6 years ago

The Picard Program Group assignments https://github.com/broadinstitute/picard/pull/1043 PR has been merged.

sooheelee commented 6 years ago

Here are three PRs I will review:

Here are some unmerged open doc PRs (not necessarily comprehensive):

sooheelee commented 6 years ago

I've gone in and worked directly on the branches in #4025, #4068 and #4019, in the doc portions of the code. These are undergoing Travis testing and one of these is awaiting additional (dev) review.

sooheelee commented 6 years ago

Thanks everyone for your contributions to updating the tool docs for yesterday's GATK4.0 release.

In my brief surveys of the status of tooldoc updates, I noticed a number of tools without example commands. I will fill in these missing ones going forward. Thanks again for your efforts.