Closed sooheelee closed 6 years ago
Could you help us test our changes? how do I know that the documentation efforts worked? please inform picard and gatk efforts...
Another question:
In picard it is customary to have a @author
line, but it seems weird to have that show up in the actual documentation....there seem to be several options:
@author
comment in a separate comment that doesn't get used in the docs (how?)@author
lines (can you do that?)@author
lines in the documentation (seems odd)@author
lines altogether (seems not nice)I'm at the GATK workshop all day and can get to this this evening.
Reference
doesn't seem to be a class that implements CommandLineProgramGroup
in Picard. given the group effort involved I suggest that adding these groups will be done centrally.
I don't think the @author
tags will show up in the doc; they generally don't show up in javadoc, and our doc system doesn't include them in the generated doc.@author
tags are a bit sketchy to begin with, because they're block tags, so the extend all the way to the start of the next tag. But since they're stripped out, putting such a tag at the start of the javadoc can result in everything after it being silently dropped from the output. We found such a case in GATK a while back.
@vdauwera when running spark tools through the gatk-launch-soon-to-be-just-gatk script, you use a --
to separate arguments to the specific tool you're running from arguments that describe what sort of Spark setup you're trying to submit the job to. The most important of these is probably sparkRunner
, which says what type of spark cluster you're using: LOCAL
indicates that you want to simulate a spark cluster on your local machine using multithreading; SPARK
indicates that you want to submit to a configured, dedicated spark cluster, in which case you have to pass the url to the master node, and GCS
indicates that you want to use a cluster managed by Google Cloud Dataproc, in which case you need the name of the cluster. See https://github.com/broadinstitute/gatk#running-gatk4-spark-tools-on-a-spark-cluster for more info. Day-to-day on the SV team we primarily use dataproc, so the example command I was going to use for one tool looks like:
./gatk ParallelCopyGCSDirectoryIntoHDFSSpark \
--input-gcs-path gs://my-bucket/my-data-directory/ \
--output-hdfs-directory hdfs://my-dataproc-spark-cluster-m:8020/my-data \
-- \
--sparkRunner GCS \
--cluster my-dataproc-spark-cluster
I wasn't sure if it was cool to use the GCS/dataproc version or if we'd rather not tie ourselves to GCP in the examples. In this particular case I feel like it might be appropriate since it's a GCS-specific tool, but my question was more about our general strategy.
@samuelklee, thanks for the update and suggestion. I moved CollectAllelicCounts to the Coverage Analysis
category. CollectFragmentCounts isn't on the list currently so I added it to the same. I hope I'm not missing a bunch of other new tools given I missed this one.
@yfarjoun
I just pushed a branch: https://github.com/broadinstitute/picard/tree/yf_documentation_update we can use that for initial testing.
On Tue, Dec 5, 2017 at 1:56 PM, sooheelee notifications@github.com wrote:
@samuelklee https://github.com/samuelklee, thanks for the update and suggestion. I moved CollectAllelicCounts to the Coverage Analysis category. CollectFragmentCounts isn't on the list currently so I added it to the same. I hope I'm not missing a bunch of other new tools given I missed this one.
@yfarjoun https://github.com/yfarjoun
- You are now in charge of deciding whether we should include authorship in code. What the Comms team wants is for authorship to NOT show up in the gatkDoc/javaDoc. If you want to keep them, author lines should be at the bottom and formatted so they do not show up in the documentation. Geraldine is fine with completely removing them if you prefer that. There is a format trick that has javaDoc skip the author line and I can get that to you if you decide to keep some of these and @vdauwera https://github.com/vdauwera would know this or I can get you what I see in other docs. Let either of us know.
- I can help you test your changes. I think the categories are good to go now so I will need to put these into both Picard and GATK HelpConstants.java, with the latter being a placeholder until the new Picard release is incorporated into the next GATK release, with variables that then must be included in each tool doc. I will find an example in a bit. Which tool do you want to test? @cmnbroad https://github.com/cmnbroad can explain the engineering details in engineering lingo if you need more information.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/3853#issuecomment-349404645, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0jIdprE580XBgq1jL-EIV1hFOcDyks5s9ZHAgaJpZM4QitCF .
@vdauwera @sooheelee I just noticed a PR that has a command line usage example embedded in the "usageExample" field of the CommandLineProgramProperties annotation. I'm not sure if this was part of the SOP (I didn't see it mentioned), but that field is not integrated into either doc or command line help output, and should probably not be used. I didn't want to blast anything out without first mentioning it to you though.
Here is an example of an authorship that does not show up in our docs. IT MUST BE AT THE END OF THE JAVADOC PORTION. Remember @cmnbroad said that otherwise it cuts off the document abruptly. That is, anything that follows the * @author Tim Fennell
is omitted from display.
/**
* Attempts to check the sample identity of the sequence/genotype data in the provided file (SAM/BAM or VCF)
* against a set of known genotypes in the supplied genotype file (in either GELI or VCF format).
*
* @author Tim Fennell
*/
Picard docs must reflect Picards command line syntax or their syntax as how it gets exposed thru GATK? I.e. INPUT=XXX
or -I|--INPUT XXX
?
@vruano: my current understanding is that we will use the "legacy" format in picard, and have the website or some other black magic fix it up. (i.e. INPUT=XXX
It would be nice to have a few Picard tools that we feature in BPWs (e.g. MarkDuplicates) show both syntaxes:
I assume MODE gets one dash and not two, but really I don't know. So having such examples scattered throughout the tool chest for most often used tools would be great.
The second example could be prefaced with, e.g.
Picard tools are callable from GATK4. The equivalent command to the above invoked via the GATK launch script would be:
I've added a fourth tab 'categories_summaries' to the spreadsheet.
The categories and their summaries are tabulated. @cmnbroad is starting a PR to implement these categories as program groups and in the HelpConstants.
I've written summaries to my own liking and these are up for discussion.
I think that it would be fine to have both legacy and modern styles in the examples, but I think having gatk in the picard docs will not pass review.
On Tue, Dec 5, 2017 at 10:41 PM, sooheelee notifications@github.com wrote:
It would be nice to have a few Picard tools that we feature in BPWs (e.g. MarkDuplicates) show both syntaxes:
- java -jar picard.jar ValidateSamFile I=reads.bam MODE=SUMMARY
- gatk ValidateSamFile -I reads.bam -MODE=SUMMARY
I assume MODE gets one dash and not two, but really I don't know. So having such examples scattered throughout the tool chest for most often used tools would be great.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/3853#issuecomment-349522774, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0ht41UsGHu_2TgrbKKNJwDepEdMZks5s9gz4gaJpZM4QitCF .
@yfarjoun Do you mean TRAVIS tests or human-review?
@yfarjoun I've built docs from your branch to see what the changes look like. Here's the PicardDoc bundle I've zipped for you: picarddoc_yossi_test.zip. View the rest by loading index.html
into a browser. Here is one of the tool docs where you can see your changes:
[1] Download and switch to branch with changes, e.g.
git branch yf_documentation_update origin/yf_documentation_update
git checkout yf_documentation_update
[2] Generate the docs. GATK and Picard do this differently:
./gradlew clean gatkDoc
or
./gradlew clean picardDoc
[3] Open index.html
from a browser. Again, GATK and Picard have different locations for their respective docs:
build/docs/gatkdoc build/docs/picarddoc
Finally, @yfarjoun, to format example commands, which CrossCheckFingerprints doc is currently lacking, here is an example from HaplotypeCaller.
The last is hard to read so here is its contents again:
* <h3>Usage examples</h3>
*
* <p>These are example commands that show how to run HaplotypeCaller for typical use cases. Have a look at the <a href='https://software.broadinstitute.org/gatk/documentation/article?id=3893'>method documentation</a> for the basic GVCF workflow. </p>
*
* <br />
* <h4>Single-sample GVCF calling (outputs intermediate GVCF)</h4>
* <pre>
* gatk-launch --javaOptions "-Xmx4g" HaplotypeCaller \
* -R reference.fasta \
* -I input.bam \
* -O output.g.vcf \
* -ERC GVCF
* </pre>
*
* <h4>Single-sample GVCF calling with <a href='https://software.broadinstitute.org/gatk/documentation/article?id=9622'>allele-specific annotations</a></h4>
Be sure to test the example commands you put in. @vruano already found some that do not work.
@yfarjoun Since gatkDocs uses the javaDoc portion of the code to generate content, including the example command. It cannot pull the example command from elsewhere as @cmnbroad notes above.
@cmnbroad:
I just noticed a PR that has a command line usage example embedded in the "usageExample" field of the CommandLineProgramProperties annotation.
Is this field something we could take advantage of to differentiate content for the GATK forum versus Picard commandline help?
I guess we need to do some work on the doclet code, abstract out example code and use templates to transform it into the appropriate format/syntax depending what project is generating the documentation.
Alternatively and only if documentation html is well formed (xhtml like) then in theory we could use XSLT transformation style sheets to convert embedded code example encoded with xml/xhtml into the concrete syntax. Most major browsers support XSLT.
EDIT: The XSLT solution won't probably work since even if we try to change the output from html to xhtml, the fact that we are injecting the javadoc's html would probably break the xhtml.
Thanks for taking this on @vruano.
I created a separated issue about this https://github.com/broadinstitute/gatk/issues/3932
If your PR is ready for review, please copy-paste url to PR like @vruano did for Picard MergeVcfs in the spreadsheet AND tag @sooheelee so I can find it easily.
Just to be clear folks, we are using gatk
directly, not ./gatk
in the example commands. And if you can, for those of you yet to make your updates, please use compressed file examples. Those of you who've already put in changes, thank you and Comms can tidy those bits later.
<h3>Usage examples</h3>
<pre>
gatk --javaOptions "-Xmx4g" GenotypeGVCFs \
-R Homo_sapiens_assembly38.fasta
-V combined.g.vcf.gz
-O cohort.vcf.gz
</pre>
<pre>
gatk GenotypeGVCFs \
-R reference.fa
-V combined.g.vcf.gz
-O cohort.vcf.gz
</pre>
@vdauwera I think these need an explanation somewhere, perhaps the tool docs, much like the Read Filters and Variant Annotations. What say you?
Those should show up in the M2 tool doc -- or FilterMutectCalls, whatever tool actually takes those args
Yes, they show up in FilterMutectCalls. Thanks for that pointer.
Just a reminder to omit --javaOptions
unless your tool needs it. If it does, remember to kebabify this too to --java-options
.
@vdauwera We need to also generate docs for Picard metrics. I think we can classify them separately like we do Read Filters
and Variant Annotations
. E.g. Picard Metrics
.
@cmnbroad Does Barclay pull these in as well?
@sooheelee If you add @DocumentedFeature
to them, and assign them a groupName, groupSummary and summary, the top-level class javadoc will show up in the doc output. But since they don't have command line arguments, I think it would require additional dev work to make the fields level doc show up like it did in the old picard doc.
@cmnbroad - Cannot be done annotating them as @Argument
as a temporary solution? Then, a custom template for the metric group can be used...
Since I have your attention on the matter, let me test what this looks like.
@cmnbroad @magicDGS, I've moved the discussion to https://github.com/broadinstitute/gatk/issues/3947.
Hidden tools are coming out of the woodworks and needing classification.
If at least one example command for a Spark tool does not utilize Spark, then I think we need this statement in each tool doc (to which it applies):
This tool can be run without explicitly specifying Spark options. That is to say, the given example command without Spark options will run locally. See Tutorial#10060 for an example of how to set up and run a Spark tool on a cloud Spark cluster.
This tool can be run without explicitly specifying Spark options. That is to say, the given example command without Spark options will run locally. See <a href ="https://software.broadinstitute.org/gatk/documentation/article?id=10060">Tutorial#10060</a> for an example of how to set up and run a Spark tool on a cloud Spark cluster.
This tutorial isn't the best, as it only focuses on Google Dataproc, but it's what we have currently to get people started. Let's just say it's a placeholder until we get something better up.
@yfarjoun @vdauwera I've refined the tool categorization based on feedback on the tentative categorization. Thank you @yfarjoun for the review and feedback. The refinement is reflected in the new tabbed sheet in the shared Google Spreadsheet:1217Changes_categorization-and-assignments
. I've separated out GATK vs Picard tools for each of the categories.
Here is a summary of the changes.
New 11 to Diagnostics and QC
:
AnalyzeCovariates (from Alignment, Duplicate flagging and BQSR)
GatherBQSRReports (from Alignment, Duplicate flagging and BQSR)
FlagStat (from Read Data Manipulation)
FlagStatSpark (from Read Data Manipulation)
GetSampleName (from Read Data Manipulation)
Picard BamIndexStats (from Read Data Manipulation)
Picard CalculateReadGroupChecksum (from Read Data Manipulation)
Picard CheckTerminatorBlock (from Read Data Manipulation)
Picard CompareSAMs (from Read Data Manipulation)
Picard ValidateSamFile (from Read Data Manipulation)
Picard ViewSam (from Read Data Manipulation)
Merge 14 tools remaining in Alignment, Duplicate flagging and BQSR
with 37 tools in Read Data Manipulation
. Keep latter name.
51 tools
Move these out of Read Data Manipulation
:
CompareDuplicatesSpark (to DxQC)
ConvertHeaderlessHadoopBamShardToBam (to Other)
CreateHadoopBamSplittingIndex (to Other)
Move ValidateVariants into Variant Evaluation
. Also:
Variant Evaluation and Refinement
--> Variant Evaluation
VCF Manipulation
--> Variant Manipulation
Let us know your thoughts. Thank you.
Hey developers working on GATK4 doc updates (tagging tech leads here) @cwhelan @samuelklee @ldgauthier @vruano @yfarjoun @LeeTL1220. Just a reminder (from @vdauwera and @droazen) that we want to get rid of many of the short form arguments and only keep the long form for the more obscure arguments. Meaning we keep both forms for commonly used and understood arguments.
Alright, everyone. If you can have your tooldoc updates merged by this Friday, that would be great. That gives me time to take stock of what is missing and update the documentation accordingly over the weekend.
The Picard Program Group assignments https://github.com/broadinstitute/picard/pull/1043 PR has been merged.
Here are three PRs I will review:
Here are some unmerged open doc PRs (not necessarily comprehensive):
I've gone in and worked directly on the branches in #4025, #4068 and #4019, in the doc portions of the code. These are undergoing Travis testing and one of these is awaiting additional (dev) review.
Thanks everyone for your contributions to updating the tool docs for yesterday's GATK4.0 release.
In my brief surveys of the status of tooldoc updates, I noticed a number of tools without example commands. I will fill in these missing ones going forward. Thanks again for your efforts.
Thank you everyone for your contributions towards this documentation effort. Instructions from @vdauwera
to followat this Google doc Favorite tool doc examples from @vdauwera NOW in her SOP doc. Spreadsheet from @sooheeleeto beposted here