SATAY-LL / LaanLab-SATAY-DataAnalysis

This contains codes and workflows for data analysis regarding SATAY experiments.
Apache License 2.0
4 stars 3 forks source link

Transposon numbers mitochondrial DNA, MISTAKE IN BENOIT MATLAB CODE #16

Closed leilaicruz closed 4 years ago

leilaicruz commented 4 years ago

See the discussion led by Wessel Teunisse in the SATAY forum , with Benoit from ETH. https://groups.google.com/forum/#!category-topic/satayusers/bioinformatics/WESlwxJcM2s

Gregory94 commented 4 years ago

I'll consider this bug when translating the Matlab code to Python and double check that I don't incorporate this in our code.

Wteunisse commented 4 years ago

Just to leave the discussion here as well:

I might have found a miscalculation in the elife-23570-code1-v3.m code, as provided with the 2017 eLife paper about SATAY. It seems that the transposon numbers given for mitochondrial genes are incorrect. I believe this is due to a mismatch in labeling the mitochondrial DNA. As can be seen in the added screenshots, the genes.chr and infobam.SequenceDictionary use different labels for mitochondrial DNA, 'mt' and 'Mito' respectively. During the concatenation of all the chromosomes into one big chromosome (adding the total length of all the chromosomes before the current chromosome to the gene coordinates) this leads to a problem in the following line:

aa=find(strcmp(genes.chr,infobam.SequenceDictionary(ii).SequenceName)); %get all the genes from the current chromosome (represented by 'infobam.SequenceDictionary(ii).SequenceName').

genes.coordinates(aa,:)=genes.coordinates(aa,:)+ll;

This works for all chromosomes but the mitochondrial DNA as there are different names used. This results in the coordinates shown in genes.cordinates, till chromosome 16 all chromosomes are concatenated one after each other, but the mitochondrial chromosome starts at position 0 again. This mistake is not made for the tncoordinates, as it uses chromosome numbers instead of text labels.

Why are there still transposons shown in the mitochondrial DNA in tnpergene? This is because it actually counts the transposons on chromosome 1 as those transposons coordinates are now within the gene coordinates of the mitochondrial genes.

The whole thing can easily be solved by renaming one of the labels. I added the following before the main loop:

genes.chr_str=string(genes.chr); % to make genes.chr use same terminology as infobam.SequenceDictionary / don't forget to rename this variable in later use.

for ic = 1:length(genes.chr_str)

if genes.chr_str(ic) == 'mt'

    genes.chr_str(ic) = 'Mito';

end

end

This is only a problem when further analysis is done in matlab, because the wrong variable is not written into the .wig or .bed file. Only the table file produced by matlab does include the wrong information

genes chr_ScreenShot

infobam SequenceDictionary_ScreenShot

genes coordinates_ScreenShot

tncoordinates_ScreenShot