a-slide / pycoMeth

DNA methylation analysis downstream to Nanopolish for Oxford Nanopore DNA sequencing datasets
GNU General Public License v3.0
34 stars 2 forks source link

[Question] For proper usage on differential methylation #31

Closed jolespin closed 2 years ago

jolespin commented 4 years ago

Data Description:

I have nanopolish call-methylation output for each of the samples.

My goal is to identify methylation events that are present in t=1 and t=2 but absent from t=0. In other words, methylation events that are potentially caused by a treatment. I'm also interested in the opposite, methylation events that are exclusive to t=0.

I've already used nanopolish on each of the samples and I'm wondering the best way to utilize pycoMeth for this analysis.

My plan was to use the following pycoMeth pipeline:

  1. CpG_Aggregate
    usage: pycoMeth CpG_Aggregate [-h] -i [NANOPOLISH_FN [NANOPOLISH_FN ...]] -f
                              REF_FASTA_FN [-b OUTPUT_BED_FN]
                              [-t OUTPUT_TSV_FN] [-d MIN_DEPTH] [-s SAMPLE_ID]
                              [-l MIN_LLR] [-v] [-q] [-p]

  2. Interval_Aggregate
    usage: pycoMeth Interval_Aggregate [-h] -i CPG_AGGREGATE_FN -f REF_FASTA_FN
                                   [-a INTERVAL_BED_FN] [-b OUTPUT_BED_FN]
                                   [-t OUTPUT_TSV_FN] [-n INTERVAL_SIZE]
                                   [-m MIN_CPG_PER_INTERVAL] [-s SAMPLE_ID]
                                   [-l MIN_LLR] [-v] [-q] [-p]

  3. Meth_Comp
    usage: pycoMeth Meth_Comp [-h] -i [AGGREGATE_FN_LIST [AGGREGATE_FN_LIST ...]]
                          -f REF_FASTA_FN [-b OUTPUT_BED_FN]
                          [-t OUTPUT_TSV_FN] [-m MAX_MISSING]
                          [-l MIN_DIFF_LLR]
                          [-s [SAMPLE_ID_LIST [SAMPLE_ID_LIST ...]]]
                          [--pvalue_adj_method PVALUE_ADJ_METHOD]
                          [--pvalue_threshold PVALUE_THRESHOLD]
                          [--only_tested_sites] [-v] [-q] [-p]

  4. Comp_Report
    usage: pycoMeth Comp_Report [-h] -i METHCOMP_FN -g GFF3_FN -f REF_FASTA_FN
                            [-o OUTDIR] [-n N_TOP] [-d MAX_TSS_DISTANCE]
                            [--pvalue_threshold PVALUE_THRESHOLD]
                            [--min_diff_llr MIN_DIFF_LLR]
                            [--n_len_bin N_LEN_BIN] [--export_static_plots]
                            [-v] [-q] [-p]

My questions:

  1. For the NANOPOLISH_FN argument in CpG_Aggregate, do you recommend that I group all of t=1 and t=2 together for all specimens if I am interested in methylation events that are persistent? Or do you recommend that I treat t=1 and t=2 separately based on the type of output that results from CpG_Aggregate?

  2. How does SAMPLE_ID work? If nothing is provided for this argument, how are records named in the bed file?

  3. For the sliding window in Interval_Aggregate, is there overlap in the window or is it getting mutually exclusive segments?

  4. Let's say I use Meth_Comp with 2 files: t=0 and t > 0 (i.e. t=1 & t=2). Is there a way to know which methylation events (or intervals) are more prominent in either condition or is it more showing just that there is a difference?

Thank you in advance. Looking forward to finally using your tool! It's exactly what I've been looking for.

a-slide commented 4 years ago

Hi @jolespin,

1.) Ideally I would like to implement a test that takes replicates into account to model the biological/technical variability, but at the moment it is not the case. So I guess you should probably group you replicates together provided that you have a relatively homogeneous coverage. As for the time groups, I would run all 3 groups separately. From the output of Meth_comp you can find out how conditions were grouped based on the labels and med_llr_list columns which are always in the same order.

2.) If nothing is provided it leave the name field in the bedgraph file blank. It's definitely better to pass it.

3.) It is strictly non_overlapping. Except if it doesn't make sense I would recommend to use CpG Island annotations generated by CGI_Finder or providing external CpG annotations.

4.) Yes, I think I more or less explained in question 1

jolespin commented 4 years ago

Thank you for the helpful response!

I will definitely group each timepoint separately (t=0, t=1, t=2).

Just so I understand, if I did this approach I would do the following:

  1. Run CpG_Aggregate for all samples of t=0, then repeat for t=1, and so on for t=2 using the t=x as the -s SAMPLE_ID on each of those runs
  2. For each of the 3 CpG_Aggregate output files, I would then run Interval_Aggregate using the -s SAMPLE_ID identifiers as I did in Step 1
  3. Run Meth_Comp using all 3 files generated from Interval_Aggregate
  4. One I have Meth_Comp output, I would be able to identify methylation events that are in the following conditions: 4a. both t=1 and t=2 but absent in t=0 4b. Exclusively in t=0

For example, if I use t=1, t=2, and t=3 for the -s SAMPLE_ID of each group, would the output be something along the lines of the following (note, I copy/pasted from the docs and changed the labels):

chromosome start  end    n_samples pvalue              adj_pvalue          neg_med pos_med ambiguous_med labels                med_llr_list                raw_llr_list                                       comment                
I          118799 118800 4         0.29978058859571194 0.4279805197751711  3       1       0             ["t=1","t=2"] [2.005,-2.645,-6.64,-5.51]  [[-1.82,5.83],[-0.09,-5.2],[-2.67,-10.61],[-6.7...Non-significant pvalue 
I          141415 141416 4         0.11800901597381579 0.3742064977444858  1       1       2             ["t=0"]
I          151819 151820 4         0.1315327225513005  0.3742064977444858  1       1       2             ["t=0", "t=1"] [-2.98,-1.95,2.93,-1.48]    [[-3.68,-4.4,0.54,-2.28],[-3.45,-1.95,-1.21],[4...Non-significant pvalue 

Where the first line would be condition 4a and the second would be condition 4b? The third line wouldn't be anything (this is ignoring p/adj_pvalue)

Is this the appropriate usage? I'm working on a pipeline for the institute so we will be running this quite a bit. I want to make sure I understand what I'm running.

Thanks again!