SYSU-zhanglab / RNA-m5C

m5C mapping and site calling pipeline.
MIT License
8 stars 7 forks source link

Question about the other scripts in "5_m5C_step-by_step-call_site" #4

Open hengjwj opened 1 month ago

hengjwj commented 1 month ago

Hi,

Thank you for providing this helpful tool. I had a question regarding some scripts in the folder "5_m5C_step-by_step-call_site" which are not mentioned in "m5C-BS-seq-step-by-step-computation-protocol-v1.1.pdf". Where should these scripts be used and under what kind of circumstances?

m5C_caller_temp_filter.py
m5C_evaluate_C_cutoff.py
m5C_site_number.py
m5C_tell_CR.py

Additionally, could you provide a bit more detail on when it would be good to use gene-specific CR cutoff or overall CR cutoff (for samples.txt, which is used as input for m5C_caller_multiple.py)?

Joel

jhfoxliu commented 1 month ago

Hi Joel,

It depends on your library quality. Basically in our m5C papers, we have very complete conversion >99%, so we just use cutoff=3.

If you are not sure about the quality. you can run the "m5C_evaluate_C_cutoff.py" script.

This script will return you two figures. In the figure, we define reads > C-cutoff are noise, and reads <= C-cutoff are signals. Then we can have a curve between signal/(signal+noise) and Gini index/Number of sites called. When there is a complete conversion, you should see small difference between different cutoffs, like the following fiugre. In this case, the difference between C-cutoff=3 and C-cutoff=5 is small.

In another case, if you see big difference between different cutoffs, it means that conversion is not so complete, and ther are a lot of badly conversion reads if you use C-cutoff=5 or more.

image

I would say it is better to redo the experiment in this case, because the conversion background can be very biasedly high in some regions. Alternatively, you can decide to use very strict cutoff like C-cutoff=2 in this case. It is possible to use strict C-cutoff but lower signal ratio (in normal situtation is >0.9) to get more sites.

image

For gene-specific cutoff and global cutoff. I think the conclusion is that the BS-seq protocol used in our papers have big heterogeneity in conversion in each gene, so I don't suggest using global conversion except for some regions without annotaions. However, if you are using other BS-seq protocols, for example UBS-seq which has low bacckground, it is should be fine for both.

Best,

Jianheng

hengjwj commented 6 days ago

Hi Jianheng,

Thank you for getting back to me on this. I tried running m5C_evaluate_C_cutoff.py. Initially, I thought my BS data was very noisy so I tried increasing the -g value just to see if it would run. It would run for a while, but would end with these errors:

[2024-06-11 13:52:42] Evaluating Gini... Traceback (most recent call last): File "/home/s190075/working/scripts/rnaseq_m5c/5_m5C_step-by_step-call_site/m5C_evaluate_C_cutoff.py", line 282, in <module> C_cutoff_analysis(results,SP_cutoffs) File "/home/s190075/working/scripts/rnaseq_m5c/5_m5C_step-by_step-call_site/m5C_evaluate_C_cutoff.py", line 56, in C_cutoff_analysis 'Gini':cal_gini(gene_counter.values()), File "/home/s190075/working/scripts/rnaseq_m5c/5_m5C_step-by_step-call_site/m5C_evaluate_C_cutoff.py", line 21, in cal_gini return (fair_area - area) / fair_area ZeroDivisionError: float division by zero

I tried to see what was happening and i think it might be because of these lines:

for sp in SP_cutoffs:
    site_count = 0
    gene_list = []
    for gene,SP in records:
        if SP >= sp:
            if gene == "NA":
                site_count += 1
            else:
                site_count += 1
                gene_list.append(gene)

When I check my m5C.pileups.formatted.txt, I noticed that there are rows with "NA" for the gene name. Might you have any insight into why this might be so? I followed the PDF using Ensembl GRCh38 v112 annotation files.

1       11329   -       NA      NA      NA      NA      NA      1       1       0       0       1       0       1;0,0,0 2;0,0,0 3;0,0,0 4;0,0,0 5;0,0,0 6;0,0,0 7
;0,0,0 8;0,0,0 9;0,0,0 10;0,0,0        15;0,0,0        20;1,1,1
1       11334   -       NA      NA      NA      NA      NA      1       1       0       1       0       0       1;0,0,0 2;0,0,0 3;0,0,0 4;0,0,0 5;0,0,0 6;0,0,0 7
;0,0,0 8;0,0,0 9;0,0,0 10;0,0,0        15;0,0,0        20;1,1,0
1       11336   -       NA      NA      NA      NA      NA      1       1       0       1       0       0       1;0,0,0 2;0,0,0 3;0,0,0 4;0,0,0 5;0,0,0 6;0,0,0 7
;0,0,0 8;0,0,0 9;0,0,0 10;0,0,0        15;0,0,0        20;1,1,0
1       11337   -       NA      NA      NA      NA      NA      1       1       0       1       0       0       1;0,0,0 2;0,0,0 3;0,0,0 4;0,0,0 5;0,0,0 6;0,0,0 7
;0,0,0 8;0,0,0 9;0,0,0 10;0,0,0        15;0,0,0        20;1,1,0
1       11339   -       NA      NA      NA      NA      NA      1       1       0       0       1       0       1;0,0,0 2;0,0,0 3;0,0,0 4;0,0,0 5;0,0,0 6;0,0,0 7
;0,0,0 8;0,0,0 9;0,0,0 10;0,0,0        15;0,0,0        20;1,1,1

Joel

hengjwj commented 6 days ago

Separately, if not too inconvenient, would it be possible to share the metadata files your team used? It'll be helpful. Maybe I did something wrong in the metadata set up stage. Thank you!