SATAY-LL / LaanLab-SATAY-DataAnalysis

This contains codes and workflows for data analysis regarding SATAY experiments.
Apache License 2.0
4 stars 3 forks source link

Another bug in Benoit Matlab code #17

Closed Gregory94 closed 3 years ago

Gregory94 commented 4 years ago

There might be another mistake in the Matlab code from Benoit. This results in that the number of read(s) of the last insertion location of all chromosomes are ignored and that the number of read(s) for the first insertion of some chromosomes are wrong. Since this only regards the first and last insertions of each chromosome (which are typically located somewhere in a non-coding region at the very beginning or end of the chromosome), this shouldn't be a big problem for us.

In short, the problem has to do with the second for-loop in the code (for ii=2:size(start2,1)) and the variables: tncoordinates (stores the location of the insertions), readnumb (stores the number of reads for each insertion) and ll (which is a counter that should keep tncoordindates and readnumb in sync with each other, which apparently it doesn't always do).

More details will follow (and maybe a question on the satay forum as well (?)) ...

Gregory94 commented 4 years ago

Explanation of the bug

This is a short explanation of the piece of code where the error occurs. The following variables are involved:

A simplified pseudo version of the code where the error occurs looks like this:

ll = 0
for all chromosomes:
    tncoordinates(ll) = [chromosome, position first read, orientation first read]
    mm = 0
    for all reads in current chromosome:
        if current read is within two basepairs from the previous read in tncoordinates and has the same orientation:
            mm = mm + 1
        else:
            tncoordinates(ll) = [chromosome, position current read, orientation current read]
            mm = 0
            ll = ll + 1
        readnumb(ll) = mm + 1

The problem lies with that readnumb is updated for every loop over the reads, but tncoordinates is only updated when the current read is not within two basepairs of the previous read and/or does not have the same orientation. This means that readnumb is always one step ahead of tncoordinates, which is fine as long as the loop is running. But, for the last read the readnumb variable is updated as usual, but the tncoordinates is not updated for the current chromosome as the code does not reach the else-statement for this last insertion. This means that the last insertion together with all its reads is not assigned to the current chromosome, but to the first read of the following chromosome. In short, the last read of every chromosome is ignored and first read of some of the chromosomes might be wrong. Also, the first read of the first chromosome is ignored when there is only a single read for this insertion.

In summary, the first and last insertion of all chromosomes are potentially wrong or missing. This does not have to be a big issue since the first and last insertions are typically not within a gene, but it is good to be aware that the start and end insertions of the chromosomes might be less reliable when using this Matlab code.

For a more detailed explanation and some comparisons of the matlab code and other software (my python code and IGV), check this .doc file on the drive ("M:\tnw\bn\ll\Shared\Gregory\Notes\Error_Benoit_MatlabCode.docx")

leilaicruz commented 4 years ago

great @Gregory94 ! , I think it is good to raise this point in the Forum as well , for others to know, and perhaps Benoit can fix it ...

Gregory94 commented 4 years ago

See discussion on SATAY forum here.

leilaicruz commented 3 years ago

If this issue is done , can you close it?