BimberLab / nimble

nimble — execute lightweight, flexible alignments on arbitrary reference libraries
MIT License
1 stars 1 forks source link

Suggestions for nimble debug/QC report #65

Open bbimber opened 2 months ago

bbimber commented 2 months ago

1) There is a plot showing the distribution of positions in the BAM. This is useful, but I would see if it is possible to convert the x-axis from genomicPosition (which we need to use in the raw data) and report it by chromosome. I R/ggplot I solved this in the past by manually specifying breaks and labels. For example, if you know the chromsome 1 position 1 = genomicPosition 1, you can label as such. You can repeat for each contig. You would need a sequence dictionary providing ordered contigs and lengths, but you probably already have this.

1b) This will be a little tricky when the actual region is a very specific zone within one contig. However, perhaps you could explicitly test this by inspecting the min/max genomicPositions. If their contigs are the same, then maybe label the x-axis as "Position on XX"? In this case, reporting actual contig position would be useful to people. Another thought might be to report the x-axis using actual contig position, but facet the plot horizontally by contig (ideally allowing the python plot code to distribute space unevenly)? i actually wonder if this idea would be easier to implement than when I suggested in #1.

2) Related to the plot above, I would facet by R1/R2 if possible. Putting these above one another probably makes the most sense. I think you have them as two lines, but it's not easy to see both now.

3) Just a comment: the "Concordance between the input BAM calls and the corresponding nimble call" is probably useful.

4) In the "Distribution of nimble base pair scores across r1 and r2 mates" plot, do you think it would make sense to just drop unaligned (maybe report that as a number or percentage in the title text), to make the range within aligned more clear?

5) In the "Distribution of nimble base pair scores across r1 and r2 mates" plot, since there is a hard cutoff being applied, would it make sense to add a dotted line at this threshold to denote that?

6) Why is the 'Density of positions reported in the input BAM file for read-pairs that received a nimble alignment' plot faceted for NKG2D, but not the others? are those 2 different contigs? If so, you might already be doing what I suggested in 1B. I would just: a) label them as such, b) allow the facets to have different widths, and c) report positions unique to that contig, d) still facet R1/R2 vertically.

7) Comment: CCR7 seems to have a real clear issue with R1 alignment, for example. Same on CD27. Same for SELL.

8) Comment: the "Concordance between the input BAM calls and the corresponding nimble call" plots are informative for the LILRs in particular.

9) This is picky, but what's the sort order of the features? Maybe it should be alphabetic?

There are some additional plots I think would be useful:

1) Could you summarize the data in terms of F/F, F/U, F/R, R/R, R/U, etc.? Maybe as a tile plot, where the color is scaled based on the proportion of read pairs in each category?

2) Can you summarize reads that are filtered and the reason for filtering? I realize nimble doesnt have complete access to all the data yet, but whatever information is available would be useful. For example, we see a lot of cases where features have strong R2 hits, but lots of zeros for R1. Understanding what happened to those R1s could be important.

3) Since the set of features can be large, it might be a nice enhancement to have some kind of bullet list of features names at the top, where each is a link to jump to the start of that feature's section.

4) It would probably be useful to print the a basic header first, with something like "Nimble QC Report", that also prints the date run, tool version, perhaps the exact command executed, name of reference library, etc.

5) Should sense/antisense be represented in any of these figures? For example, if a given feature is getting a lot of R1 or R2 hits in the opposite direction (which the F/R figure might also report), then we should have a way to know about that.

hextraza commented 2 months ago

Need to also measure the ambiguous hits-- specifically what combinations of ambiguity there are and what counts for each combination.