sooheelee commented 6 years ago

Thank you everyone for your contributions towards this documentation effort. Instructions from @vdauwera ~~to follow~~ at this Google doc Favorite tool doc examples from @vdauwera NOW in her SOP doc. Spreadsheet from @sooheelee ~~to be~~ posted here

sooheelee commented 6 years ago

I've tentatively categorized the tools and they are listed in speadsheet format at:

https://docs.google.com/a/broadinstitute.org/spreadsheets/d/19SvP6DHyXewm8Cd47WsM3NUku_czP2rkh4L_6fd-Nac/edit?usp=sharing

[1] GATK4 and Picard tool categories are up for discussion. They are meant to be functional and will be used at https://software.broadinstitute.org/gatk/documentation/tooldocs/current/. First pass by Soo Hee. If you have a better idea, please write to this issue ticket.
[2] We can do better than minimum. At minimum, each tool has a summary description and example command.
- Authorship should not be picked up by gatkDocs (but can remain in javaDoc portion of code so long as masked). If * @author Valentin Ruano-Rubio <valentin@broadinstitute.org> is placed at top of doc, causes javaDoc to not show. Such lines should be at the end of the javaDoc portion. @vdauwera prefers all author annotations be removed.
- Tool commands should use gatk to invoke the launch script, not gatk-launch. Engine team tells me this change will be effective end of this month.
- A number of tools need -Xmx to be defined and this should be reflected in the example command(s). Hopefully, if your tool needs it, you already know it. Otherwise, see https://github.com/broadinstitute/gatk/issues/3137.
[3] **AMENDED** Documentation of Picard tools in the Best Practices are a priority as is categorization of Picard tools. In the forum tool list, Picard tools will be mixed with GATK tools alphabetically, with the PICARD label coming after the tool name.

To view docs, build with ./gradlew clean gatkDoc, then view local index in browser.

sooheelee commented 6 years ago

@vdauwera The tools are categorized and listed in the Google Spreadsheet above. It is waiting for you to assign tech leads to tools for documentation.

One thing that @chandrans brought to my attention is that for BaseRecalibrator one of the parameters (-bqsr) actually causes an error. One can no longer generate the 2nd recalibration table with correction on the fly and instead must use the recalibrated BAM through BaseRecalibrator to generate the 2nd recal table for plotting. This type of information is missing from the tool docs. Furthermore, updates I made to the BQSR slidedeck (that showcase this -bqsr parameter) are based on information from a developer and this information turns out to be incorrect now (perhaps correct at some point in development?). Soooo, I think it may be prudent that those responsible for tool docs test the commands on data.

[4] Make sure the doc content enables Best Practices, e.g. plotting BQSR recalibration, and
[5] Test example commands to ensure they work. If they do not, make corrections and notate the change in application in the documentation.
[6] Remember @vdauwera's plan to change the representation of parameters from camel to KEBAB case. Issue is https://github.com/broadinstitute/gatk/issues/2596. Geraldine would like your help to do this for the tools you are responsible for. Remember to change the integration tests too.

What the gatkDocs look like as of commit of `Mon Nov 20 17:30:46 2017 -0500` where we upgraded htsjdk to 2.13.1:

gatkdoc.zip

Download and load the index.html into a web browser to click through the docs.

sooheelee commented 6 years ago

Geraldine says she is busy catching up this week so I think it best if tech leads assign the tools to members of their teams @droazen @cwhelan @samuelklee @ldgauthier @vruano @yfarjoun @LeeTL1220.

sooheelee commented 6 years ago

If we can agree on tool categorization sooner than later, this gives @cmnbroad time to engineer any changes that need engineering.

samuelklee commented 6 years ago

Any chance we could break off legacy CNV tools into their own group? There are many more of them than there will be in the new pipelines---and many of them are experimental, deprecated, unsupported, or for validation only---that I think it makes sense to hide them and perhaps be less stringent about their documentation requirements. Anything we can do to reduce the support burden before release would be great.

sooheelee commented 6 years ago

I just learned that KEBAB case is different from SNAKE case @cmnbroad. Sorry if KEBAB is offensive @cmnbroad but it is meant to clarify syntax (e.g. https://lodash.com/docs#kebabCase). To be clear, Geraldine wants KEBAB case that uses hyphens, and not SNAKE case, which uses underscores.

So --emitRefConfidence would become --emit-ref-confidence.
So --contamination_fraction_to_filter would become --contamination-fraction-to-filter.

@vruano will describe how he uses constants to manage parameters.

vruano commented 6 years ago

Since we are going to change many of those argument names (camel-back to kebab-case) I think we should take this opportunity to use constants to specify argument names in the code and use them in our test code so further changes in argument names don't break tests.

Take as an example CombineReadCounts. Extract enclosed below.

It might be also beneficial to add public constant for the default values.

public final class CombineReadCounts extends CommandLineProgram {

    public static final String READ_COUNT_FILES_SHORT_NAME = StandardArgumentDefinitions.INPUT_SHORT_NAME;
    public static final String READ_COUNT_FILES_FULL_NAME  = StandardArgumentDefinitions.INPUT_LONG_NAME;
    public static final String READ_COUNT_FILE_LIST_SHORT_NAME = "inputList";
    public static final String READ_COUNT_FILE_LIST_FULL_NAME = READ_COUNT_FILE_LIST_SHORT_NAME;
    public static final String MAX_GROUP_SIZE_SHORT_NAME = "MOF";
    public static final String MAX_GROUP_SIZE_FULL_NAME = "maxOpenFiles";
    public static final int DEFAULT_MAX_GROUP_SIZE = 100;

    @Argument(
            doc =  "Coverage files to combine, they must contain all the targets in the input file (" +
                    TargetArgumentCollection.TARGET_FILE_LONG_NAME + ") and in the same order",
            shortName = READ_COUNT_FILE_LIST_SHORT_NAME,
            fullName  = READ_COUNT_FILE_LIST_FULL_NAME,
            optional = true
    )
    protected File coverageFileList;

    @Argument(
            doc = READ_COUNT_FILES_DOCUMENTATION,
            shortName = READ_COUNT_FILES_SHORT_NAME,
            fullName = READ_COUNT_FILES_FULL_NAME,
            optional = true
    )
    protected List<File> coverageFiles = new ArrayList<>();

    @Argument(
            doc = "Maximum number of files to combine simultaneously.",
            shortName = MAX_GROUP_SIZE_SHORT_NAME,
            fullName = MAX_GROUP_SIZE_FULL_NAME,
            optional = false
    )
    protected int maxMergeSize = DEFAULT_MAX_GROUP_SIZE;

    @ArgumentCollection
    protected TargetArgumentCollection targetArguments = new TargetArgumentCollection(() ->
            composeAndCheckInputReadCountFiles(coverageFiles, coverageFileList).stream().findFirst().orElseGet(null));

    @Argument(
            doc = "Output file",
            shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME,
            fullName  = StandardArgumentDefinitions.OUTPUT_LONG_NAME,
            optional  = false
    )
    protected File outputFile;

sooheelee commented 6 years ago

@samuelklee Because our repo is open-source, even if we hide them from the docs, users end up asking questions on them. So no to hiding any tool that is in the repo.

More importantly, it is good to have documentation for our own selves for these tools that we use internally or legacy tools. Once a developer leaves, this is all that typically remains for the rest of us that may have to answer forum question on the code or further develop the functionality. We get questions on all ranges of tool versions and minor releases.

Even when we deprecate a tool or feature, we give people fair warning that the tool/feature will be deprecated before literally removing it from the codebase.

I understand you would rather folks drive the new car but the old clunky car model is out there on the road and is being driven by researchers right now. When they have questions, they come to the GATK forum for answers and we need to have at the least tool documentation.
If time is limiting, the least we should do for these legacy tools is to state within the doc that (i) folks should use another tool XYZ instead and (ii) the tool/option will be deprecated in the near future.

Besides the BETA label, another option that will soon become available is the Experimental label for internal tools. @cmnbroad is implementing now. It would be great to have additional categories, but @cmnbroad says that this is as much as he has time to do for us and perhaps this is for the best because we don't want to clutter our docs with too many labels. Perhaps @vdauwera can weigh in with thoughts and options.

Not an appropriate label for legacy/deprecated tools.

samuelklee commented 6 years ago

Fair points. I agree that legacy tools/versions that are part of a canonical or relatively widely used pipeline should have good documentation.

However, there are many of the CNV tools that are basically prototypes---they have never been part of a pipeline, have no tutorial materials, and the chances that any external users have actually used them are probably extremely low. The sooner they are deprecated, the less the overall burden on both comms and methods---I don't think comms should need to feel protective of code or tools that developers are willing to scrap wholesale!

I'd like to cordon off or hide such tools so the program group doesn't get too cluttered---if we can do this in a way that doesn't require @cmnbroad to add more categories, that would be great. For example, we will have 5 tools that one might reasonably try to use for segmentation (PerformSegmentation, ModelSegments, PerformAlleleFractionSegmentation, PerformCopyRatioSegmentation, and PerformJointSegmentation). The first two are part of the legacy and new pipelines, respectively, but the last 3 were experimental prototypes. I think it's definitely confusing to have these 3 presented in the program group, and treating them the same as the other tools in terms of documentation is just extra work for everyone.

In any case, I definitely think an additional program group to separate the legacy and new tools is warranted, since many of the updated tools in the new pipeline have very similar names to the legacy tools. If this is OK with everyone, I'll just add a "LegacyCopyNumber" program group, which I don't think should require extra work on anyone else's part.

vdauwera commented 6 years ago

Hiding / deprecating tools and their docs

@samuelklee To add to @sooheelee's answer, if there are any tools that you definitely want gone and already have a replacement for, I would encourage you to kill them off (ie delete from the code) before the 4.0 launch. While we're still in beta we can remove anything at the drop of a hat. Once 4.0 is out, we'll have a deprecation policy (exact details TBD) that will allow us to prune unwanted tools over time, but it will be less trivial. And as Soo Hee said, everything that's in the current code release MUST be documented. We used to hide tools/docs in the past and it caused us more headaches than not.

That being said, as part of that TBD deprecation policy it will probably make sense to make a "Deprecated" program group where tools go to die. If there are tools you plan to kill but don't want to do it before 4.0 is released for whatever reason, you could put them there. Documentation standards can be less stringent for tools in that bucket. To be clear I think the deprecation group name should be generic, ie not named to match any particular use case or functionality. That will help us avoid seeing deprecation buckets proliferate for each variant class/ use case. Does that sound like a reasonable compromise?

vdauwera commented 6 years ago

Guidelines for converting arguments to kebab case

We're not following an external spec doc, so here some guidelines to follow instead. Keep in mind that the main thing we're going for here is readability and consistency across tools, not absolute purity, so feel free to raise discussion on any cases where you feel the guidelines should be relaxed. Some things are more negotiable than others.

Use all lower-case (yes, even for file formats).
Use only dash (-) as separator, no underscores (because lots of newbies struggle to differentiate the two, and underscores take more effort to type than dashes).
Separate words rather than smushing them together, eg use --do-this-thing rather than --dothisthing (this is really important for readability, especially for non-native English speakers).
Avoid cryptic abbreviations and acronyms; eg use --do-this-thing rather than --dtt
If you end up with --really-long-argument-names-that-take-up-half-a-line, please reach out and ask for a consult; maybe we can find a more succinct way of expressing what you need.
If you run into any situation not covered above, please bring it up in this thread.

vdauwera commented 6 years ago

Using constants for argument names

Sounds like a fantastic idea -- I encourage everyone to follow @vruano's lead on this one.

samuelklee commented 6 years ago

OK, great---I'll issue some PRs to delete some of the prototype tools soon and update the spreadsheet accordingly. A non-CNV-specific "Deprecated" program group seems reasonable to me if there is enough demand. If this is the only way to delineate the legacy CNV + ACNV pipeline from the new pipeline, I'm OK with it---but we should probably make the situation clear at any workshops, presentations, etc. between now and release that might focus on the legacy pipeline.

On a different note, are there any conventions for short names that we should follow?

magicDGS commented 6 years ago

I propose to still hide from the command line and docs the example walkers. They are meant only for developers, to show how to use some kind of walkers and have a running tool for integration tests. Having then in the command line will generate software users to run them instead of use them for developmental purposes...

In addition, I think that this is a good moment to also generate a sub-module structure (as I suggested in #3838) to separate artifact for different pipelines/framework bits (e.g., engine, Spark-engine, experimental, example-code, CNV pipeline, general-tools, etc.). For the aim of this issue, this will be useful for setting documentation guidelines in each of the sub-modules: e.g., example-code should be documented for developers, but not for the final user; experimental module should have the @Experimental barclay annotation in every @DocumentedFeature; etc.

cmnbroad commented 6 years ago

A couple of comments:

If we're going to add a generic "Deprecated" program group, it should be added in Picard rather than GATK, so it can be used by both. If its defined in GATK, Picard can't use it, and deprecated Picard tools won't be grouped with deprecated GATK tools (we currently have this situation with the SAM/BAM group and the VCF group - the Picard tools are grouped separately from the GATK ones, but #3824 fixes that).
There is a CommandLineProperties attribute called "omitFromCommandLine". If set to true, the tool isn't displayed in the list of tools on the command line. Currently, all of the example tools; about 5 SV tools; and FixCallSetSampleOrdering have this set to true. I personally don't think we should clog up the command line with the Example tools - they're for developers, not end users, but either way there should probably be a policy on how to employ this property.

ldgauthier commented 6 years ago

To clarify the build process noted above "view local index in browser" means open the index.html file at gatk/build/docs/gatkdoc/

sooheelee commented 6 years ago

We need standard arguments to show up in the documentation for both Picard and GATK. Can @droazen or @cmnbroad please let me know this is happening?

cmnbroad commented 6 years ago

The standard arguments for each tool are listed with that tool's arguments (if you look at the doc for a particular tool, you'll see an "Optional Common Arguments" heading, with the shared, common arguments listed there).

The GATK4 doc system doesn't generate a separate page for these like GATK3 did, and I think doing so would be of questionable value, since there are several classes of tools, each of which has it's own set of "common" arguments (GATK Walker tools, GATK Spark tools, Picard tools, and some GATK "cowboy" tools that do their own thing).

We did discuss an alternative design a while back with @droazen and @vdauwera, but that was never implemented, and was a variant of the current design where the common args are included with each tool.

sooheelee commented 6 years ago

@cmnbroad and @vdauwera Barclay doesn't pull the USAGE_DETAILS portion of Picard tools towards gatkDocs. So Picard documentation is minimal with just a summary description of each tool.

screenshot 2017-11-27 16 43 27

Doesn't seem right to duplicate the same information in a tool doc, once in the asterisked javaDoc portion and once in USAGE_DETAILs for whatever system creates this view, which I am to understand will go to the wayside someday in favor of Picard documentation being offered only through https://software.broadinstitute.org/gatk/.

Seems we should use the asterisked gatkDoc portion for GATK-specific documentation we want, e.g. commands that invoke Picard tools through the gatk launch script and using GATK4 syntax, and pull the rest of the documentation from the USAGE_DETAILS (Picard jar command example).

I've prioritized Picard tools in a second tab of the shared Google spreadsheet towards Picard doc updates. Please let me know how we want to approach Picard tool doc updates @vdauwera.

sooheelee commented 6 years ago

@cmnbroad

We did discuss an alternative design a while back with @droazen and @vdauwera, but that was never implemented, and was a variant of the current design where the common args are included with each tool.

Sounds like we don't want separate documents for standard arguments and this was decided some time ago by @droazen and @vdauwera. So am I correctly hearing that these can be removed from the list?

sooheelee commented 6 years ago

This needs update after categorization changes are finalized and so counts are approximate.

Tally of tools (excludes filters/annotations) :

127 GATK tools
94 Picard tools
221 total tools

Per category:

category (14)	number of tools (221)
Reference	6
Base Calling	7
Diagnostics and Quality Control	49
Contamination	7
Intervals Manipulation	11
Read Data Manipulation	46
Alignment, Duplicate flagging and BQSR	16
Short Variant Discovery	8
Short Variant Filtering	7
Short Variant Manipulation	17
Short Variant Evaluation and Refinement	14
Copy Number Variant Discovery	28
Structural Variant Discovery	4
Other	1

Per developer, assuming ~20 of us, this means ~11 tools per developer. If folks are feeling generous and will claim more, this frees up busy coworkers for other work. If I've forgotten anyone, please add to table. Megan is away until next year.

developer (25)	number of tools updated (claimed)
Yossi	0
Valentin	0
Ted S.	0
Ted B.	0
Takuto	0
Sara	0
Sam L.	0
Steve	0
Sam F.	0
Marton	0
Mehrtash	0
Mark W.	0
Maddi	0
Louis	0
Lee	0
Jose	0
Jonn	0
Laura	0
Mark F.	0
James	0
David R.	0
David B.	0
Chris W.	0
Chris N.	0
Andrey	0

Folks should claim the 11-12 tools they will work on, by putting their name on the spreadsheet next to the tools. Otherwise, we will assign you tools. SOP to follow.

sooheelee commented 6 years ago

Call for any objections/changes to the classification scheme

As we need to start implementing the functional organization. This will change Picard tool organization @yfarjoun.

yfarjoun commented 6 years ago

I'm not sure I agree with the "short variant filtering" and "short variant manipulation" and "Short Variant Evaluation and Refinement" definitions. For example FixVcfHeader has nothing to do with "short variants" and everything to do with VCF. you could put a SV into a VCF and then fix its header with the tool. similarly you could filter a vcf that has "large" variants in it...also, "count variants" has nothing to do with "small variants"...

On Mon, Nov 27, 2017 at 6:03 PM, sooheelee notifications@github.com wrote:

Call for any objections/changes to the classification scheme

As we need to start implementing the functional organization. This will change Picard tool organization @yfarjoun https://github.com/yfarjoun.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/3853#issuecomment-347358639, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0rnAueYp7qOJlNoKlt9318XAD2URks5s6z_cgaJpZM4QitCF .

samuelklee commented 6 years ago

Some of the CNV tools are miscategorized: GetHetCoverage + tools in the "Intervals Manipulation" category. The latter should probably be considered CNV-specific because they either use the target-file format (which is only used in the legacy CNV + ACNV pipeline) or perform a task that is specific to the CNV pipeline and probably not of general interest (PreprocessIntervals).

sooheelee commented 6 years ago

@yfarjoun Geraldine has promised to followup on the categorization discussion. @samuelklee Remember that the Best Practice Workflows and related documentation will guide folks to which tools to use for each workflow. The tool docs section is meant to categorize based on function and is purposefully workflow-agnostic.

samuelklee commented 6 years ago

That's fine, but my point is that these tools will almost certainly be used only in the CNV workflows due to their limited function (or reliance on CNV-specific file formats). Workflow-agnostic categorization is great for general tools that might be shared by several workflows, but I think it's somewhat misleading here.

Essentially, this is the opposite of the issue that @yfarjoun pointed out (where more general tools are assigned to a specific workflow...)

sooheelee commented 6 years ago

What is an A+ tooldoc example?

@vdauwera @ldgauthier, could either or both of you help your fellow developers out with what you consider an A+ tool doc? @mwalker174 and others have asked for this. It would help those new to GATK tool documentation immensely.

In the meanwhile, here are some from me, in order of increasing complexity, and just to start the discussion.

PrintReads gives three flavors although the utility of the first two commands may be questionable.
HaplotypeCaller gives overview, breaks down the steps (with easy-read spacing and formatting), select usage examples for various contexts, caveats and a special note on ploidy. There is even an Additional Notes section with useful information. I might think of merging the last two sections somehow if I had to do it over.
SelectVariants gives many different usage examples, appropriate for such a versatile tool.

@ldgauthier says CalculateGenotypePosteriors is solid and VariantsToTable and, again, SelectVariants are good too. @vdauwera interjects saying SelectVariants has too many alternate command examples but agrees it is solid.

yfarjoun commented 6 years ago

Tool commands should use gatk to invoke the launch script

(my emphasis)

what about Piacrd tools? I don't think it is appropriate for them to have gatk in the commandline...can this be clarified please?

sooheelee commented 6 years ago

@yfarjoun Any Picard tool involved in the Best Practices, as defined by those tools in the WDL scripts at https://github.com/gatk-workflows, should (also) have a command that uses the gatk script to launch the tool. This is what Geraldine conveyed previously and I tentatively made a list of such tools in the second spreadsheet picard-tools-by-interest, under [A]. All other Picard tools can keep the Picard jar example command.

One thing we want, if possible, is for commands that showcase a reference to either showcase GRCh38 or a nondescript reference, e.g. reference.fasta. We should minimize exposure of hg19/b37.

yfarjoun commented 6 years ago

hmmm. I doubt it will go down well with the maintainers of PICARD to have "gatk" written as an example of how to use the tool....I think that this isn't a good solution, and I don't think that it will pass review in the picard reository...I do not feel comfortable opening a PR that does this...

On Thu, Nov 30, 2017 at 9:18 AM, sooheelee notifications@github.com wrote:

@yfarjoun https://github.com/yfarjoun Any Picard tool involved in the Best Practices, as defined by those tools in the WDL scripts at https://github.com/gatk-workflows, should (also) have a command that uses the gatk script to launch the tool. This is what Geraldine conveyed previously and I tentatively made a list of such tools in the second spreadsheet picard-tools-by-interest, under [A]. All other Picard tools can keep the Picard jar example command.

One thing we want, if possible, is for commands that showcase a reference to either showcase GRCh38 or a nondescript reference, e.g. reference.fasta. We should minimize exposure of hg19/b37.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/3853#issuecomment-348200262, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0m6XNNJPPXqWPEBAjVUAKyqoYKZiks5s7rkbgaJpZM4QitCF .

sooheelee commented 6 years ago

I hope @vdauwera's visit at the Methods meeting addressed concerns @yfarjoun.

yfarjoun commented 6 years ago

It did, thanks!

On Thu, Nov 30, 2017 at 7:36 PM, sooheelee notifications@github.com wrote:

I hope @vdauwera https://github.com/vdauwera's visit at the Methods meeting addressed concerns @yfarjoun https://github.com/yfarjoun.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/gatk/issues/3853#issuecomment-348366430, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0qk88fa-Vh7CrvGvk1wcrWkXCl_Iks5s70omgaJpZM4QitCF .

sooheelee commented 6 years ago

Geraldine posted an SOP to https://docs.google.com/a/broadinstitute.org/document/d/1r1AV4yWP4_vNmniUDR5LojihuggMDI2OnEpfRiYyvdk/edit?usp=sharing.

sooheelee commented 6 years ago

Just a reminder tech leads @droazen @cwhelan @samuelklee @ldgauthier @vruano @yfarjoun @LeeTL1220 @cmnbroad , that the tool doc updates need to be done, reviewed and merged as soon as possible. Please assign your people tools to update if you haven't already and let us know the status of the changes in the STATUS column of the spreadsheet.

Please prioritize tools that are featured in any Best Practice workflow. The forum docs revolve mostly around Best Practice workflows.

@chandrans and I have to then take your new kebab parameters and edit the entirety of the forum documents to represent the new syntax and WE HAVE LESS THAN 9 WORKING DAYS TO DO SO as of today 12/4. @chandrans is leaving for the holidays starting December 15 and will not be back until near the release on January 9. Once she goes on holiday, I take over her forum duties, which is a full-time job. We really need to have these changes now so we can start working on updating forum docs as the updates are merged to master.

Thank you for your work towards these improvements. Again, although I cannot help in changing code, and do not understand the intricacies, I have brought in homemade cheesecake towards fueling your work towards these updates.

samuelklee commented 6 years ago

@sooheelee Some updates from CNV:

I told my team to hold off on doc updates until we can finalize tool deletions. The first round of CNV tool deletions was just merged in #3903.

Another round may be coming, pending discussions with @vdauwera and @LeeTL1220. This could potentially remove the entire old somatic workflow. If so, then the tools that we'd need to update for release would be:

PreprocessIntervals (@MartonKN) AnnotateIntervals CollectAllelicCounts CollectFragmentCounts (@asmirnov239) CreateReadCountPanelOfNormals DenoiseReadCounts ModelSegments CallSegments (updated version) CombineSegmentBreakpoints (@LeeTL1220)

DetermineGermlineContigPloidy (@mbabadi) GermlineCNVCaller (updated version) (@mbabadi)

Except where indicated, I'll be responsible for updates to these tools.

Until a final decision is made about tool deletion, CNV team will hold off on self-assigning their remaining tool quotas.

samuelklee commented 6 years ago

Also, note that 10 tools were deleted in #3903 and 17 could potentially be deleted in the next round, so everybody's quota should go down accordingly.

cwhelan commented 6 years ago

I moved the tool ParallelCopyGCSDirectoryIntoHDFSSpark to the other category since it's just a data movement utility that's not really tied to any pipeline. Hope that's OK.

cwhelan commented 6 years ago

Does anyone have thoughts about or examples of what a command line example for a spark tool should look like? In particular I'm wondering what we should put for the Spark-cluster specific parameters that come after the -- separator, like sparkRunner, etc.

droazen commented 6 years ago

I don't think we can rename the Spark cluster arguments @cwhelan -- most of these "pass through" to the underlying spark-submit/gcloud command.

cwhelan commented 6 years ago

@droazen No, wasn't suggesting renaming the args themselves, I just was wondering what kind of example values we should pass in the usage example. For example, is it OK to pretend your usage example is running on a dataproc cluster, eg -- --sparkRunner GCS --cluster my-dataproc-cluster.

mwalker174 commented 6 years ago

@sooheelee I have a suggestion regarding categories. Can we change "Contamination" to "Metagenomics" and perhaps move the "CalculateContamination" tool to the "Diagnostics and Quality Control" category?

IMO, contamination has a connotation of introducing foreign matter unintentionally. Strictly speaking, PathSeq is not just for detecting sample contaminants but also endogenous organisms in various biological sample types (like stool or saliva). I think users with metagenomic data might overlook this if they are labeled as being for "contamination."

sooheelee commented 6 years ago

samuelklee Thank you for reducing the tool count. It is now 211 (~10.5 per developer assuming 20 able bodies) and could soon be 194 (~10 per developer). I appreciate you keeping us posted and hope to hear back soon about the other tools.

@cwhelan Thanks for moving the tool to the other category. I definitely mis-categorized that one. As for spark parameter examples, yours looks good and complies with GCS requirements.

-- --sparkRunner GCS --cluster my-dataproc-cluster

In the FlagStatSpark tutorial I note that a cluster name:

must be all lowercase, start and end with a letter and contain no spaces nor special characters except for dashes.

P.S. This is the tutorial to note for background information on setting up Spark.

sooheelee commented 6 years ago

@mwalker174 Thank you for your suggestion. I will take it and incorporate it now into the spreadsheet. Metagenomics certainly has a more positive connotation to it than Contamination. If anyone objects, please let us know here (@davidbenjamin?).

sooheelee commented 6 years ago

@yfarjoun I have renamed three categories:

short variant filtering --> variant filtering
short variant manipulation --> VCF manipulation
Short Variant Evaluation and Refinement --> Variant Evaluation and Refinement

I also switched the category ordering of the last two, which the forum will reflect (asterisked switched):

...
Short Variant Discovery
Variant Filtering
Variant Evaluation and Refinement*
VCF Manipulation*
Copy Number Variant Discovery
...

sooheelee commented 6 years ago

uncomfortable categories left unchanged or changed

Picard LiftoverVcf and RemoveNearbyIndels are not best described by VCF Manipulationbut rather could fit under Variant Evaluation and Refinement but I will leave them as is.
MarkIlluminaAdaptors doesn't fit so well under Base Calling. Could be better under Read Data Manipulation or Alignment, Duplicate flagging and BQSR. I will keep as is for historic Illumina category.
CreateSomaticPanelOfNormals is currently under ~~Short Variant Discovery as it supports Mutect2 calling. However, it could be better under~~ Variant Filtering ~~, Variant Evaluation and Refinement or VCF Manipulation as it really just outputs the sites called in at least two samples.~~ Sounds like filtering but also refining a cohort. However, the PoN is meant mostly for artifacts of mapping/sequencing and so its records, although mostly germline variants, are not strictly germline variants. moved 12/7

sooheelee commented 6 years ago

Looks like someone added a new 15th category: RNA-specific Tools. Two tools under this category are:

ASEReadCounter
SplitNCigarReads

@vdauwera fyi is this okay with you?

These were previously under Diagnostics and QC and Read Data Manipulation, respectively.

sooheelee commented 6 years ago

Based on Geraldine's suggestion created new category Coverage Analysis.

Since ASEReadCounter fits under this category, alongside:

ASEReadCounter
CountBases
CountBasesSpark
CountReads
CountReadsSpark
DepthOfCoverage
DiagnoseTargets
GetHetCoverage
GetPileupSummaries
Pileup
PileupSpark

I have deleted the RNA-specific Tools category and moved SplitNCigarReads back to Read Data Manipulation.

vdauwera commented 6 years ago

@cwhelan Good question about the spark arguments. Can you please post a command showing how you typically use spark args, and we can discuss how to make that as generic as possible?

samuelklee commented 6 years ago

@sooheelee We've decided to delete the old tools. I will issue a PR soon. Hopefully this lightens the load on everyone a bit!

Incidentally, I would be fine with adding the tools CollectFragmentCounts and CollectAllelicCounts to the CoverageAnalysis category, as they are performing relatively generic tasks that fall in that category. (with the caveat that the output formats contain column headers that are specific to the new CNV workflows).

davidbenjamin commented 6 years ago

. . .and perhaps move the "CalculateContamination" tool to the "Diagnostics and Quality Control" category?

@sooheelee I like this idea from @mwalker174.

broadinstitute / gatk

All hands on deck: tool doc updates #3853

https://docs.google.com/a/broadinstitute.org/spreadsheets/d/19SvP6DHyXewm8Cd47WsM3NUku_czP2rkh4L_6fd-Nac/edit?usp=sharing

What the gatkDocs look like as of commit of `Mon Nov 20 17:30:46 2017 -0500` where we upgraded htsjdk to 2.13.1:

gatkdoc.zip

Hiding / deprecating tools and their docs

Guidelines for converting arguments to kebab case

Using constants for argument names

This needs update after categorization changes are finalized and so counts are approximate.

Call for any objections/changes to the classification scheme

What is an A+ tooldoc example?

uncomfortable categories left unchanged or changed

broadinstitute / gatk

All hands on deck: tool doc updates #3853

https://docs.google.com/a/broadinstitute.org/spreadsheets/d/19SvP6DHyXewm8Cd47WsM3NUku_czP2rkh4L_6fd-Nac/edit?usp=sharing

What the gatkDocs look like as of commit of Mon Nov 20 17:30:46 2017 -0500 where we upgraded htsjdk to 2.13.1:

gatkdoc.zip

Hiding / deprecating tools and their docs

Guidelines for converting arguments to kebab case

Using constants for argument names

This needs update after categorization changes are finalized and so counts are approximate.

Call for any objections/changes to the classification scheme

What is an A+ tooldoc example?

uncomfortable categories left unchanged or changed

What the gatkDocs look like as of commit of `Mon Nov 20 17:30:46 2017 -0500` where we upgraded htsjdk to 2.13.1: