3UTR / DaPars2

Dynamics analysis of Alternative PolyAdenylation from RNA-seq
GNU General Public License v2.0
47 stars 22 forks source link

DaPars 2 for long reads #23

Open ArthurDondi opened 1 year ago

ArthurDondi commented 1 year ago

Dear Dapars2 developers,

I work with long-reads (LR) scRNA data and wanted to compare 3'UTR between cell types. I gave DaPars2 a try, however I realised that the breakpoint detection does not work properly for LR as stated in #22 .

I then used DaPars2_Multi_Sample_Multi_Chr.py and adapted it for LR (DaPars2_Two_Samples_Multi_Chr_LR.py).

Briefly, I first look at the last covered point in the UTR (currently min_cov = 10), and define the search region to be between UTR start and last covered point (last_cov). I then look for the breakpoint in the same fashion as DaPars, for each position x, I :

The breakpoint is then the position x with the biggest squared Short_UTR_abun value.

It can probably be improved, and I'd be very happy to hear any suggestion, but it gives already convincing results (I can't share all the results yet sadly):

The original DaPars found (see image below):

Gene fit_value Predicted_Proximal_APA Loci Red Green
ENST00000361866.8|COL6A1|chr21|+ 1298.1 46003542 chr21:46003391-46005048 1.00 1.00

And the LR method:

Gene fit_value Predicted_Proximal_APA Loci Red Green
ENST00000361866.8|COL6A1|chr21|+ 1764.4 46004101 chr21:46003391-46005048 0.90 0.71

Which is much closer to the data.

Currently, DaPars2_Two_Samples_Multi_Chr_LR.py only works for 2 samples as I wanted to compute the Fisher-exact P-value, but it can be easily changed for a multi-samples version without Fisher test.

I also added a merge_Dapars.py script to merge results from all chromosomes, perform BH FDR correction, compute DPUI and keep the site with highest DPUI per gene, but it's only accessory.

Screenshot 2023-06-30 at 12 09 54