cgat-developers / cgat-apps

cgat-apps repository
Other
33 stars 14 forks source link

bam2geneprofile - intervals at the edge of contigs #67

Closed IanSudbery closed 4 years ago

IanSudbery commented 4 years ago

geneprofile.GeneCounter.count includes a check that when you calculate the flanks of an interval/gene, you don't accidentally start asking for negative coordinates:

https://github.com/cgat-developers/cgat-apps/blob/master/cgat/BamTools/geneprofile.pyx#L1039..L1044

This is a problem for two reasons:

  1. If the GTF entry starts at 0, then the calculated range will be (0,0), which is a 0 length range and causes the dependent RangeCounters (e.g. RangeCounterBigWig) to error
  2. Upstream and downsteam flanks are treated differently - a range could fall off the 3' end of the contig and that wouldn't be caught, but if it falls off the 5' end it would.

What do people think is the best solution for this? I was thinking of putting in a check for 0 length ranges in the RangeCounters, or wen the RangeCounter is called, which would solve 1, but not 2 above, but would allow invalid GTF entries to pass through silently (maybe a good thing, maybe not. Maybe should pass with a warning?).

Alternatively zero length flanks could be skipped at the point flaks are calculated. This would mean that invalid GTF entries themselves would still cause an error (for better or worst).

All these cases suggest that there should be some sort of flanking interval where possible even when it can't be full length, and only skip when its unavoidable. But if you are normally takein 200 samples out of 1kb, you add a 10bp flank, you are that regions is going to have very different properties from all the others.

None of these solutions deal with flanks going off the end of contigs on the 3' end.

Acribbs commented 4 years ago

Sounds like adding a check when RangeCounter is called would be most sensible. Ideally it should produce a warning. I would avoid entries passing through silently, preferable for the user to be warned first.

Sorry, wasn't sure what you meant by this: "But if you are normally takein 200 samples out of 1kb, you add a 10bp flank, you are that regions is going to have very different properties from all the others."

"None of these solutions deal with flanks going off the end of contigs on the 3' end." - not add a check and then discard if it's larger than contain size?

IanSudbery commented 4 years ago

bam2geneprofile takes resolution samples from extentension basepairs.

For example if you 1kb flanks and a 200 resolution, bam2geneprofile takes samples every 5bps from that 1kb.

If you suddenly have a truncated flank, the resolution stays the same but the size of the range gets smaller. If the interval starts at coordinate 200, then the flank will only be 200bp, so samples will be taken every 1bp for that interval but not for the others.

"not add a check and then discard if it's larger than contain size?", I don't think GeneCounter has anyway of knowing the contig sizes.

IanSudbery commented 4 years ago

Okay, so its going to be neccessary to check for 3' of contig, at least as for BigWigs becauase pyBigWig errors if a interval outside the bounds of the contig is requested on either end.

But this checking has to be done in RangerCounterBigWig becausae GeneCounter has no way to access this info. It could also be done on the various BAM counters, but not the Bed counter, which doesn't know contig lengths.

We are now left with a situation where the 5' end is being checked by Gene/Interval/UTRCounter etc is checking for 5' end, and RangeConter/BigWig/BAM etc is checking for the 3' end. The advantage of checking in the GeneCounters is that they know if the range in question is a primary interval, or a flank, but doesn't know the contig length, while the RangerCounts might know the contig lengths, but don't know if the range is a primary interval or a flank :(