Ensembl / WiggleTools

Basic operations on the space of numerical functions defined on the genome using lazy evaluators for flexibility and efficiency
Apache License 2.0
144 stars 25 forks source link

Introduce the shiftPos command, shifting read counts by an integer value #34

Closed joshuak94 closed 5 years ago

joshuak94 commented 5 years ago

Also addresses some small misc. corrections.

joshuak94 commented 5 years ago

@dzerbino Before you accept this pull request, I think there's an issue with shifting the positions past the length of the chromosome... I'm not sure how I should check for this quite yet.

dzerbino commented 5 years ago

The trick is to only allow shifts towards smaller coordinates. Doing so you can easily suppress negative coords. Could you please update the code correspondingly?

joshuak94 commented 5 years ago

I'm not sure what you mean by "towards smaller coordinates". The issue is not only negative coordinates, as theoretically if a read is at the very end of a chromosome, shifting it + 200 could give it a coordinate value which is higher than the chromosome length.

dzerbino commented 5 years ago

Right, but could you constrain the system such that you are only allowed to shift negatively?

On Wed, 8 May 2019, 04:50 Joshua Kim, notifications@github.com wrote:

I'm not sure what you mean by "towards smaller coordinates". The issue is not only negative coordinates, as theoretically if a read is at the very end of a chromosome, shifting it + 200 could give it a coordinate value which is higher than the chromosome length.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Ensembl/WiggleTools/pull/34#issuecomment-490404813, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAIDV7AKA7FCUSKW2SKHH3PUKH57ANCNFSM4HK66VCQ .

joshuak94 commented 5 years ago

So only allow something like wiggletools shiftPos -30 x.wig? Then there could be two potential issues:

  1. Read start and end are at 0 and 36, so then shifted it becomes -30 and 6.
  2. Read starts at the beginning of some chromosome other than chromosome 1. Then the shifted read is now technically on a different chromosome.

For issue 1), would the solution be to change the values from (-30, 6) to (0, 6)? Or to just prevent the shift entirely, so the read stays at (0, 36).

For issue 2), any solution requires wiggletools to know the chromosome lengths already...

I'm in the process of parsing through the MaSC.pl code to understand how they deal with these issues.

dzerbino commented 5 years ago

Regarding 1) I would argue that the shifted block should be clipped at 0, hence should become (0, 6)

Regarding point 2) I do not see why blocks at the beginning of other chromosomes would be shifted onto another chromosome. The genome is not a continuous sequence. They would get dropped or clipped as if they were on chrom 1.

joshuak94 commented 5 years ago

Yeah, point 2 was I think me misunderstanding how the bam files are written.

joshuak94 commented 5 years ago

So, quick question about something:

If the start, end coordinates are (10, 30), and I want to shift by -15, expected behavior is now (0, 15).

If I want to instead shift by -50, this entire line should just be removed. I don't want a line that says (0, 0). Here's what I have so far:

void ShiftPosIteratorPop(WiggleIterator * wi) {
    ScaleWiggleIteratorData * data = (ScaleWiggleIteratorData *) wi->data;
    WiggleIterator * iter = data->iter;
    if (!iter->done) {
        if (data->scalar < iter->finish) {
            wi->chrom = iter->chrom;
            wi->start = (data->scalar >= iter->start ? 0 : iter->start - data->scalar);
            wi->finish = iter->finish - data->scalar;
            wi->value = iter->value;
        }
        pop(iter);
    } else {
        wi->done = true;
    }
}

Basically, I'm expecting that if the end coordinate is larger than the shift value, then it assigns the start coordinate to either 0 or iter->start - data->scalar and the end to just iter->finish - data->scalar, and the rest is kept. If, however, the end coordinate is <= the shift value, I skip all of it and just pop, which I was hoping would just skip the line. However, it doesn't, and I get a result like:

fixedStep chrom= start=0 step=1
fixedStep chrom=chr1 start=0 step=1
4.000000
5.000000
6.000000
7.000000
8.000000
9.000000

Any tips? I guess I'm not quite understanding pop(iter) and what it does.

dzerbino commented 5 years ago

Pop(iter) goes to the next block in the upstream iterator, which you would normally do once after reading the current block.

What you want to do is pop it as many times as necessary to get a valid block, process that one, then pop one last time to update iter.

In short it should look like:

while(coords not good) pop(iter)

update wi coords and value pop iter one last time

joshuak94 commented 5 years ago

Ah okay, thanks!

Also, you said we should only allow shifting toward smaller coordinates. Is there a reason we don't want to allow shifting in the other direction?

dzerbino commented 5 years ago

While it's OK to shift coordinates arbitrarily in Wiggletools, the Kent tools complain if you go beyond the chromosome length. Since WT does not know what the chrom lengths are it's safer to go downwards only.

joshuak94 commented 5 years ago

Makes sense!