Open jpdna opened 6 years ago
Can one of the admins verify this patch?
+1! This looks like a great start. I'll make more of a pass over this later this week. Thanks for the first cut @jpdna!
Jenkins, add to whitelist.
Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/avocado-prb/232/
Looking for general comments about utility and approach - not ready for final review yet.
This PR adds functionality to produce a per-position report of counts of reads reporting an A/C/G/T/N or ins/del at each genomic position covered by at least one read in an input
alignmentRecordRDD
The functionality is intended to mirror that of the
bam-readcount
program https://github.com/genome/bam-readcount which is useful to get such per-position stats for use in downstream applications such as testing new variant calling models or evaluating background sequencing noise at a given position.The approach of this implementation is to perform a
alignmentRecordRDD.mapPartitions
, in which each partition builds in memory a hash of typemutable.Map[ReferencePosition, AlleleCounts]
which is then returned as RDD[ReferencePosition, AlleleCounts], then combined across partitions by reduceByKey adding counts at positions, to resolve partition overlapping reads. This per-partition approach seemed efficient given the size of the hash is limited to the fraction of the genome in a partition ( which can be small if genomic pos sorted), and avoids a flatmap that would produce a large RDD with every base of every read as elements.The CIGAR/MD reading algorithm was adapted from existing
DiscoverVariants
, but the context of reporting every covered position and use of the mapPartitions is sufficiently different that I think a new function/object makes sense rather than adding options toDiscoverVariants
1) Just checking that this user facing per-position summary function doesn't already exist in Avocado or elsewhere in ADAM? 2) Is a command line utility in Avocado a reasonable place for this functionality to live?
Todo: 1) further validate, compare against Bam-readcount output, add tests.