erhard-lab / grandR

R package for nucleotide conversion sequencing data analysis
Other
9 stars 2 forks source link

4SU dropout plots #26

Open ivvukovic opened 9 months ago

ivvukovic commented 9 months ago

Hi! I was wondering if you could help me understand why do the 4SU dropout plots look the way they do. I was able to generate 4SU plots with my data and use the correction function. After correction the 4SU plots look great but I thought that my 4SU plots before correction look particularly weird. Any ideas why? I included the top (before correction) and bottom (after correction) 4SU plots. Thank you! Ivana image image

ivvukovic commented 9 months ago

Hi! After a bit more investigation I think I understand why these plots look the way they do. I would really appreciate reassurance that I am using your pipeline properly.

There are a number of genes that have 0s as NTR in these different timepoints ( I labeled for 0, 1, 2 and 4 hrs with 4SU, the number of genes that are "unlabeled" is higher in lower time points as expected). The plot I posted in my previous post is from 4hr timepoint, and the "missing" NTR ranks between 0-1000 are because the 0 NTR samples are all sharing the same rank on the x axis. There is a 4SU impact on the cells (which is more evident at 4hrs than it is at 1 and 2hrs post labeling) so I used the 4SU HL spline method to correct for the 4SU impact and the new 4SU dropout plots look perfect. I looked at the NTR values in the original data and after the 4SU correction - and many of the original 0 NTRs now have values. Presumably, because labeled RNA for the genes in question is underestimated. 4SU dropout correction by factor doesn't actually fix my data.

I read both of the papers that I could find on grandR and while in principle I understand how 4SU dropout correction works, I can't follow the math.

When would one use 4SU spline method vs 4su factor method to correct for the impact that 4SU has on cells? Is it all right to use it in my case?

florianerhard commented 9 months ago

Dear Ivana, which grandR version do you use?

are because the 0 NTR samples are all sharing the same rank on the x axis

yes!

many of the original 0 NTRs now have values

this is disturbing: if the NTR is the same for two genes before correction, it should also be the same after correction for these genes, i.e. all genes that hat NTR=0 should have NTR=x after. Did you do any additional filtering? A minimal example to test that would be good!

In general, the linear method is what we used and tested in our paper (that should be out on NAR now anytime). It means that a specific percentage of labeled RNA dropped out and was not observed (could e.g. be because globally transcription is inhibited, a part of the 4sU containing RNA is lost, or 4sU containing reads could not be mapped properly). This would explain the dropout plot, since genes with high NTR appear to be downregulated more than genes with low NTR in comparison to a no4sU control. In this case it makes sense to scale labaled RNA up again. If this is not enough for your data, this likely means that other effects play a role, and I would be very, very careful with the interpretation of your results (and I would be already careful with a linear effect). Best, Florian

ivvukovic commented 9 months ago

Dear Florian, Thank you for responding.

I am using grandR version 0.2.2. I did filter the data using filter function (200 reads, required in of all the samples).

I checked and the ntr values - that were zeros before- are not corrected to the same value after the correction with the spline method - but they are very close to being the "same" value. I did find this confusing and scary at first - after reading your documentation I thought I understood what happened: NTR =0 after adjustment may not be NTR=x for each of those original 0s because the log FC 4SU/no4SU is different for those genes and that is used (among other things? in the spline method) to correct the data. Then it would make sense why the NTR values are slightly different. Am I thinking about this incorrectly?

I think part of the 4sU containing RNA is lost in some of my samples. Maybe due to the reverse transcription problems when creating libraries? 0s do occur more often in lower timepoints but they do occur at later timepoints too. Eg. there will be a gene with NTR values at 1hr and 4hr of labeling but it could have NTR of 0 at 2hr of labeling.

The interesting part is that the data after the spline correction method makes more sense (in terms of half life and synthesis rates) when thinking about the biology behind my experiment. The fits after the spline corrections are much better - I guess the 0s are now corrected? I couldn't figure out the math behind how the spline method works exactly - but I did lower the degrees of freedom to 10 - to try to not over-fit the data. The half lives in the original data (before 4SU correction) seem to be overestimated.

I actually have another dataset - pulse chase experiment but only one replica - in basically the same conditions. It has moderate correlation with this dataset. Considering the different methods and the dataset caveats, while it could be better, it is not bad?

Do you have any words of advice on any other steps I could take to carefully evaluate this data?

Thank you for your time! Ivana