Per-base input to hotspot2.cpp gets collapsed when col5 = same; output should be too

erynes commented 8 years ago

Per-base 5-column input to hotspot2.cpp gets collapsed for each contiguous stretch of 1bp sites with identical values in column 5 (cleavage tallies). This greatly reduces input file size and I/O run time. The same should be done to the per-base output from hotspot2.cpp. The stretches of identical input values won't always yield stretches of identical output values of the same length (e.g., 123bp of identical col5 input values might yield 13bp of col5=0.654321, then 100bp of col5=0.393132, then 10bp of col5=0.1832901, due to differences at the edges of the sliding 50kb background window), but many stretches of output values will have the same col5 value (P-value) and the same col6 value (FDR). So some sort of checking/filtering should be done at the end, possibly both within hotspot2.cpp (which typically writes 1M 1bp output rows at a time) and immediately following hotspot2.cpp (since often the 1,000,000th row written from one block of 1M will have the same col5 and col6 values as the 1st row written from the next block of 1M).

sjneph commented 8 years ago

This change would affect footprint callers and any other downstream software of per-base cut counts. Since we don't store bases with 0 cuts (the biggest chunk of the pie), I am pretty surprised to hear that we'd see very many adjacent bases with the identical number of cleavage values. If there are large stretches like that, it might be a bug in the generation of per-base values.

sjneph commented 8 years ago

Perhaps you are referring to the output of $COUNTING_EXE or $EXCLUDE_EXE, which would be different from changing the official per-base cut counts. If so, ignore my previous comment.

erynes commented 8 years ago

Thanks Shane. I mean the output of bed_exclude.py, which is the input to hotspot2.cpp. The input to hotspot2.cpp (I'm writing '.cpp' to distinguish it from hotspot2.sh and the overall pipeline) is tallies of cleavages within small windows around each site, +/-100bp by default. If 5 sites in a row have 1 cut each, and if there are no cuts in the 100bp to the L and R of those 5 sites, then those 5 sites will each get tally values of 5. Those 5 sites will appear in one row of input to hotspot2.cpp. Currently they'll generate 5 rows of output, one row per bp. If the output rows have the same P-values (probably they will be, but it'll depend on what's in the larger background window that slides across them), it'd be well worth it to collapse them into a single row.

fwip commented 8 years ago

Huh. I was going to say "The other parts of the pipeline after that might have to be adjusted," but I think they might work fine as-is. The output gets fed to https://github.com/Altius/hotspot2/blob/master/scripts/hsmerge.sh#L50 filter, which should continue to work fine, and then to bedops -m which should work. We later intersect the merged spots with this file again, and that should work fine too.

As a quick test, this might about halve the number of lines output at -F=0.10 - maybe it would have more effect (proportionally) at higher FDR values? (And it probably depends on the parameter set we're going with as well).

erynes commented 8 years ago

Closed via commit e7fca39814717c5d7b602853ef552487ea55540e.

erynes commented 8 years ago

Closed via commit e7fca39814717c5d7b602853ef552487ea55540e.

Altius / hotspot2

Per-base input to hotspot2.cpp gets collapsed when col5 = same; output should be too #17