YeoLab / FLARE

RNA edit detection (SAILOR) and peak calling (FLARE)
Other
6 stars 3 forks source link

subregion_fraction approach infinity due to divison by zero #4

Closed skchronicles closed 5 months ago

skchronicles commented 1 year ago

Hello there,

I hope you are having a great day! Thank you again for creating and maintaining this awesome pipeline. It has been extremely useful for a project I have involving A-I editing.

I just wanted to point out an issue I noticed while running FLARE. I will submit a PR shortly, but here is a brief description of the problem below.

I first noticed the issue while viewing some of the output files of SALIOR/FLARE within IGV. In the image(s) below, I have zoomed into a random region. In this figure, the top/bottom coverage tracks are colored by allele frequency.

For this specific project, we were interested in A-I editing sites, so we are looking for A>G editing on the forward strand. These sites are represented as green and yellow coverage bars at any given location.

image

As you can see here, SAILOR is doing an awesome job calling the A>G sites, and FLARE is also doing an amazing job calling A-I editing regions for the WT samples. But for my KO samples, FLARE did not call this region. After taking a closer look, I noticed something strange in my KO samples.

I tried to grep for this gene within the KO samples' FLARE results, and there were no hits (as expected by the IGV screenshot), so I started looking back further into FLARE's previous output files, and I noticed this had gene inf background_rates.

# KO_S29.bam_all_regions_with_fdr.tsv
=========================================================================
1 chrom                                                             chr12
2 start                                                          98514788
3 end                                                            98514818
4 edit_fraction                                      0.003861003861003861
5 strand                                                                -
6 target_bases                                                        259
7 edited_bases                                                          1
8 num_edited_reads                                                      1
9 total_reads_in_region                                                83
10 fraction_reads_edited                             0.012048192771084338
11 mean_depth                                          59.193548387096776
12 num_substrate_bases                                                  8
13 subregion_coverage                                               36035
14 subregion_conversions                                                3
15 region_coverage                                                    0.0
16 region_conversions                                                 0.0
17 gene_coverage                                                      0.0
18 gene_conversions                                                   0.0
19 subregion_fraction                               5.590339892665474e-05
20 background_rate                                                    inf
21 poisson_p                                                          1.0
22 p_adj                                                              1.0
23 region_id                            TMPO-AS1:exon_0:98514788-98514818
24 overall_region_id                                      TMPO-AS1:exon_0

So, then I looked into the step that created this output file, and I found where the error was occurring.

In this script on line 45, there is an edge-case where division by 0 can occur: image

For my dataset, it happened right here:

# KO_S29.bam_all_regions_with_fdr.tsv
======================================================================
1 chrom                                                           chrX
2 start                                                       85003907
3 end                                                         85003927
4 edit_fraction                                   0.045454545454545456
5 strand                                                             +
6 target_bases                                                      22
7 edited_bases                                                       1
8 num_edited_reads                                                   1
9 total_reads_in_region                                            103
10 fraction_reads_edited                          0.009708737864077669
11 mean_depth                                        93.85714285714286
12 num_substrate_bases                                               5
13 subregion_coverage                                               22
14 subregion_conversions                                             2
15 region_coverage                                                 0.0
16 region_conversions                                              0.0
17 gene_coverage                                                   0.0
18 gene_conversions                                                0.0
19 subregion_fraction                                              inf *
20 background_rate                                                 inf
21 poisson_p                                                       1.0
22 p_adj                                                           1.0
23 region_id                            APOOL:exon_0:85003907-85003927
24 overall_region_id                                      APOOL:exon_0

For this particular region, the math works out to: (2-1)/(22-22). This causes more issues later on in the script here when finding the mean (resulting in inf values) and then the background_rate calculation (resulting in inf values again).

I will submit a PR to fix the issue now. Please feel free to review it whenever you have some free time. Thank you again for your time, and thank you again for creating this awesome pipeline! Please let me know if you have any questions, and have a wonderful weekend!

Best Regards, @skchronicles

skchronicles commented 1 year ago

I just submitted a PR to fix the issue here. Thank you again for your time. I appreciate it!

skchronicles commented 1 year ago

Here is a new screenshot of the KO FLARE A>G results at the TMPO locus: image

It is calling enriched A>G regions correctly now (addressing the inf problem fixed everything, see PR for fix).

acurry-hyde commented 5 months ago

Thank you so much for posting this fix!

ekofman commented 5 months ago

Yes thanks for taking the time to look into this and making the PR @skchronicles , looks good and I will merge it in.