bigdatagenomics / avocado

A Variant Caller, Distributed. Apache 2 licensed.
http://bdgenomics.org/projects/avocado/
Apache License 2.0
71 stars 42 forks source link

Per position allele count reporting utility ( like bam-readcount) #297

Open jpdna opened 6 years ago

jpdna commented 6 years ago

Looking for general comments about utility and approach - not ready for final review yet.

This PR adds functionality to produce a per-position report of counts of reads reporting an A/C/G/T/N or ins/del at each genomic position covered by at least one read in an input alignmentRecordRDD

The functionality is intended to mirror that of the bam-readcount program https://github.com/genome/bam-readcount which is useful to get such per-position stats for use in downstream applications such as testing new variant calling models or evaluating background sequencing noise at a given position.

The approach of this implementation is to perform a alignmentRecordRDD.mapPartitions, in which each partition builds in memory a hash of type mutable.Map[ReferencePosition, AlleleCounts] which is then returned as RDD[ReferencePosition, AlleleCounts], then combined across partitions by reduceByKey adding counts at positions, to resolve partition overlapping reads. This per-partition approach seemed efficient given the size of the hash is limited to the fraction of the genome in a partition ( which can be small if genomic pos sorted), and avoids a flatmap that would produce a large RDD with every base of every read as elements.

The CIGAR/MD reading algorithm was adapted from existing DiscoverVariants, but the context of reporting every covered position and use of the mapPartitions is sufficiently different that I think a new function/object makes sense rather than adding options to DiscoverVariants

1) Just checking that this user facing per-position summary function doesn't already exist in Avocado or elsewhere in ADAM? 2) Is a command line utility in Avocado a reasonable place for this functionality to live?

Todo: 1) further validate, compare against Bam-readcount output, add tests.

AmplabJenkins commented 6 years ago

Can one of the admins verify this patch?

fnothaft commented 6 years ago

+1! This looks like a great start. I'll make more of a pass over this later this week. Thanks for the first cut @jpdna!

fnothaft commented 6 years ago

Jenkins, add to whitelist.

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-3.2%) to 76.705% when pulling d6137af9ec741a2671f78f0e269ad5cb315ddf87 on jpdna:posAlleleStats into e0979dd11a2fd0fd8714cd062b9e7ee835a265e5 on bigdatagenomics:master.

AmplabJenkins commented 6 years ago

Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/avocado-prb/232/

Build result: FAILURE

GitHub pull request #297 of commit d6137af9ec741a2671f78f0e269ad5cb315ddf87 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-staging-worker-02 (ubuntu staging-02 staging) in workspace /home/jenkins/workspace/avocado-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/avocado.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/avocado.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/avocado.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/297/merge^{commit} # timeout=10 > git branch -a -v --no-abbrev --contains 19dfbc267513fdb14751f43d9775194719e3c13f # timeout=10Checking out Revision 19dfbc267513fdb14751f43d9775194719e3c13f (origin/pr/297/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 19dfbc267513fdb14751f43d9775194719e3c13fFirst time build. Skipping changelog.Triggering avocado-prb » 2.6.0,2.11,2.0.0,centosTriggering avocado-prb » 2.3.0,2.10,2.0.0,centosTriggering avocado-prb » 2.3.0,2.11,2.0.0,centosTriggering avocado-prb » 2.6.0,2.10,2.0.0,centosavocado-prb » 2.6.0,2.11,2.0.0,centos completed with result FAILUREavocado-prb » 2.3.0,2.10,2.0.0,centos completed with result FAILUREavocado-prb » 2.3.0,2.11,2.0.0,centos completed with result FAILUREavocado-prb » 2.6.0,2.10,2.0.0,centos completed with result FAILURE Test FAILed.