DamienFr / GC_content_in_sliding_window

Calculate GC% and GC deviation for circular genomes
17 stars 10 forks source link

Window/step #2

Closed sarah872 closed 7 years ago

sarah872 commented 7 years ago

Thank you for your super useful tool! My question is regarding the step/window option. Let's say I have a 5069183 nt long genome, and I want to calculate the %GC over 50nt - so from 1-50nt, 51-100nt and so on. What options do I have to specify? What is the difference between window and step?

DamienFr commented 7 years ago

Hello, First of all, thanks for your interest.

Then, what you want to specify is a step of 50nt because you want one value for each group of 50nt. Therefore you will have a total of (5069183/50) values.

Now, to get the GC% of a given 50nt interval (for example the 1001-1050nt interval), you can just calculate the % of G and C nucleotide of these 50 studied nucleotides. If you do that, the signal along the genome will be very noisy, leading to fast variations when you draw the curve. To soften these fast variations, you can instead calculate for each 50nt intervals, the GC% of the 1000nt surrounding that area. In my example, the 1001-1050nt interval would display the % calculated for nucleotides comprised between positions 526 and 1525.

This is important NOT to just calculate AND display with the same interval because if you do this with a window=step=1000nt for example, you will get few values that differ from each other, drawing a very not nice stairs ...

With window and step different you get a smooth line slowly increasing and slowly decreasing.

Hope i answered clearly.

sarah872 commented 7 years ago

thanks! therefore the "sliding" ;)

fulaibaowang commented 4 years ago

thanks for the great tool! just to say it might be helpful to put this explanation to the main page. Because it not that clear how G and C % is calculated exactly until I saw this thread.

DamienFr commented 4 years ago

Thanks for your input, i'll add this to the readme file.