Feature/argo benchmark results

lindaxiang commented 1 year ago

Please check if the PR fulfills these requirements
[ ] You have a descriptive and meaningful commit message.
[ ] Tests for the changes have been added (for bug fixes/features)
[ ] Docs have been added / updated (for bug fixes / features)
[ ] You have done your changes in a separate branch. Branches MUST have descriptive names that start with either the fix/ or feature/ prefixes. Good examples are: fix/signin-issue or feature/issue-templates.
[ ] You have commented the code, particularly in hard-to-understand areas
What kind of change does this PR introduce? (Bug fix, new feature, docs update, ...)
What is the current behavior? (You can also link to an open issue here)
What is the new behavior? (if this is a feature change)
How is it being tested (test Configuration)?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
Other information:

justinjj24 commented 11 months ago

@lindaxiang Our team member @mhebrard pesented last month 24/10 GA4GH Quality Control of WGS Meeting which includes comparison of NPM and ARGO generated 1KG samples QC metrics (slide 6 and 7).

If you have chance kindly look at the correlation plots in the presentation share here (https://drive.google.com/drive/folders/1Q280zSQfqQRo1q70Amq9PEOLkGVA9z1O) and provide your valuable feedback. We would like to know the difference of the Argo pipeline computation (tools, parameters and data used such as only autosomes and filtered any blacklisted regions) for further discussion and see whether it could be functionally equivalent to NPM metrics!

justinjj24 commented 10 months ago

NPM implementation details... https://c-big.github.io/NPM-sample-qc/metrics.html

lindaxiang commented 10 months ago

Thank you @justinjj24 !

lindaxiang commented 10 months ago

I looked into the results comparison between NPM and ARGO. The three metrics which have discrepancies are:

mean_autosome_coverage
pct_autosomes_15x
mad_autosome_coverage

In ARGO pipeline, we retrieved these metrics from Picard/CollectWgsMetrics (version picard: 3.0.0). The following is an example command we have used:

CollectWgsMetrics 
--INPUT NA20126.bam 
--OUTPUT NA20126.CollectWgsMetrics.coverage_metrics 
--INTERVALS autosomes_non_gap_regions.interval_list 
--VALIDATION_STRINGENCY LENIENT 
--REFERENCE_SEQUENCE GRCh38_hla_decoy_ebv.fa 
--MINIMUM_MAPPING_QUALITY 20 
--MINIMUM_BASE_QUALITY 20 
--COVERAGE_CAP 250 
--LOCUS_ACCUMULATION_CAP 100000 
--STOP_AFTER -1 
--INCLUDE_BQ_HISTOGRAM false 
--COUNT_UNPAIRED false 
--SAMPLE_SIZE 10000 
--ALLELE_FRACTION 0.001 
--ALLELE_FRACTION 0.005 
--ALLELE_FRACTION 0.01 
--ALLELE_FRACTION 0.02 
--ALLELE_FRACTION 0.05 
--ALLELE_FRACTION 0.1 
--ALLELE_FRACTION 0.2 
--ALLELE_FRACTION 0.3 
--ALLELE_FRACTION 0.5 
--USE_FAST_ALGORITHM false 
--READ_LENGTH 150 
--VERBOSITY INFO 
--QUIET false 
--COMPRESSION_LEVEL 5 
--MAX_RECORDS_IN_RAM 500000 
--CREATE_INDEX false 
--CREATE_MD5_FILE false 
--help false 
--version false 
--showHidden false 
--USE_JDK_DEFLATER false 
--USE_JDK_INFLATER false

Based on Picard/CollectWgsMetrics script, the assessment was counting

non-duplicated reads,
restricted to the intervals which cover autosome only by setting “--INTERVALS autosomes_non_gap_regions.interval_list”
achieved a mapping quality of 20 or greater by setting “--MINIMUM_MAPPING_QUALITY”
paired-end by setting “--COUNT_UNPAIRED false”

Note, the bed file of “autosomes_non_gap_regions.interval_list” was downloaded from NPM-sample-qc

Please let me know what action items I shall do to facilitate the effort.

justinjj24 commented 10 months ago

Implementation not aligned	NPM (mosdepth & datamash)	ARGO (picard-CollectWgsMetrics)
--COUNT_UNPAIRED	True	False
--MINIMUM_BASE_QUALITY 20	False	True
--ALLELE_FRACTION[0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.3, 0.5] To calculate theoretical sensitivity (HET SNP). Not relevant for coverage calculcation.	False	True
1000 base window-sizes. Coverage is computed on 1,000bp windows and averaged for the region of interest	True	False
datamash - madraw (mad_autosomes_coverage)	True	False

NPM implementation details... https://c-big.github.io/NPM-sample-qc/metrics.html

Picard-CollectWgsMetrics details... https://gatk.broadinstitute.org/hc/en-us/articles/360037226132-CollectWgsMetrics-Picard- https://broadinstitute.github.io/picard/picard-metric-definitions.html#CollectWgsMetrics.WgsMetrics

mhebrard commented 10 months ago

@lindaxiang - As we discussed, here a few action item on this PR

[ ] check on cross_contamination_rate precision: NPM pipeline output the FREEMIX score, see metrics.py#L156
[ ] pending NPM pipeline to implement picard for coverage based metrics
[ ] pending ARGO to register on dockstore.

You can find doc here and an example of dockstore.yml to include in your pipeline there You might also need some work to ensure the pipeline can run on nextflow tower (with default nextflow profile)

ga4gh / quality-control-wgs

Feature/argo benchmark results #19