Computomics / MethylScore

Identification of differentially methylated regions between multiple epigenomes from BS-treated read mappings via methylated region calling
Other
13 stars 3 forks source link

Unable to find simulated DMRs #4

Open mkpython3 opened 1 year ago

mkpython3 commented 1 year ago

Hi, I wanted to use MethylScore in my analysis for finding DMRs between ~20 different Genotypes. Therefore I first wanted to test if MethylScore is able to find simulated DMRs. For this I used metilenes background simulation script to simulate 20 samples of A. thaliana chromosome 1 drawing methylation levels from a beta distribution with parameters a=10, b=1 and then implementing DMRs with a custom script changing the methylation levels for 5 consecutive CG sites to values drawn from a beta distribution with inverted parameters a=1, b=10 in 50% of the samples. With this method I implemented 100 DMRs in the CG context. This is of course a very simple and non perfect way of simulating DMRs but this should suffice for the intend of just getting the pipeline up and running.

However I was not able to find any DMRs with MethylScore. I have tested to call DMRs with HOME alternatively, which found almost all of them with correct boundaries.

Interestingly, if I simulate the samples with a low methylation background (using beta distribution parameters a=1, b=10) and DMRs with a high level of methylation in 50% of the samples (beta distribution parameters a=10, b=1) the last step of the pipeline (DMRS:MERGE_DMRS) is skipped without visible error:

[1e/74d241] process > CONSENSUS:SPLIT_BEDGRAPH (S19:1)         [100%] 20 of 20 ✔
[36/004412] process > CONSENSUS:MATRIX:BUILD (1)               [100%] 1 of 1 ✔
[cc/1f5a2d] process > SAMPLESHEET:GENERATE (genome_matrix.tsv) [100%] 1 of 1 ✔
[85/f57be8] process > MRS:CALL_MRS (S12:genome_matrix.tsv)     [100%] 20 of 20 ✔
[17/6383bc] process > MRS:MR_STATISTICS (S12)                  [100%] 20 of 20 ✔
[2d/3f172c] process > MRS:SPLIT_MRS (all:batchsize:500)        [100%] 1 of 1 ✔
[3d/b2d032] process > DMRS:INDEX (genome_matrix.tsv)           [100%] 1 of 1 ✔
[d5/8a609b] process > DMRS:CALL_DMRS (CG:all.MRbatch.0)        [100%] 2 of 2 ✔
[-        ] process > DMRS:MERGE_DMRS                          -
WARN: Graphviz is required to render the execution DAG in the given format -- See http://www.graphviz.org for more info.
Completed at: 27-Apr-2023 11:05:26
Duration    : 2m 43s
CPU hours   : 0.4
Succeeded   : 66

I am using the following command to run MethylScore: nextflow run Computomics/MethylScore --BEDGRAPH --SAMPLE_SHEET=/scratch/mariusk/methylscore_test/samplesheet.tsv --GENOME=/scratch/mariusk/methylscore_test/Arabidopsis_thaliana.TAIR10.dna.chromosome.1.fa -profile docker --DMR_CONTEXTS CG

Im sadly unable to attach the input files directly to this issue due to size limitations, however I have uploaded the first case of a high methylation background here: https://drive.google.com/file/d/1_LKNfjrVI5ylzXdBqGid6QrVquSxY_WT/view?usp=sharing Introduced DMRs in this dataset are at the following positions:

393835  393969
418517  418788
2003984 2004147
2109083 2109208
2569172 2569379
2967632 2967668
3315692 3315884
3503494 3503627
3583380 3583597
4078186 4078481
4227003 4227188
4334846 4334917
5743592 5743667
6240901 6240984
6548515 6548647
6641201 6641376
7471468 7471750
7547517 7548102
7760154 7760226
8035196 8035398
8402500 8402530
8685467 8685561
8790599 8790671
9819252 9819402
9852252 9852383
9916057 9916196
10070847    10070893
10219783    10219954
10224750    10224846
10422856    10423016
10462585    10462610
10526601    10526668
10814851    10814964
10899133    10899173
11167080    11167416
11194751    11194825
11596910    11597130
11836754    11836926
12313861    12313946
12818518    12818602
12984426    12984699
13250013    13250216
13528312    13528351
13947334    13947477
14868353    14868410
15638549    15638724
16329536    16329571
16833679    16833827
16866955    16867104
16903114    16903307
16924986    16925045
17052473    17052576
17256103    17256127
18976429    18976550
19014836    19014913
19088534    19088658
19286339    19287092
19354498    19354582
19475855    19475923
19505494    19505638
20099991    20100058
20186034    20186241
20926924    20927016
21167056    21167244
21180101    21180168
21311272    21311441
21531193    21531285
22172847    22172894
22180002    22180550
22408689    22409039
22938037    22938100
22947486    22947735
23668779    23668984
24130230    24130331
24189255    24189288
24742545    24742621
24934153    24934265
25545726    25545889
25669868    25669925
25673210    25673405
25881091    25881140
26453262    26453534
26739762    26739816
26805084    26805459
27574493    27574732
28073147    28073481
28090913    28091170
28308716    28308795
28351441    28351691
28691911    28692227
28706160    28706295
28958464    28958674
29159617    29160088
29245316    29245354
29458094    29458153
30018157    30018178
30157251    30157318
30246555    30246732
30348834    30348880
30382826    30383265

Here is a visual example of a simulated DMR in the uploaded data: Screenshot from 2023-04-27 15-57-25 The X axis represents the genomic coordinates, the methylation levels of the 20 samples is on the Y axis. One DMR is shown with a red background as an example, the other DMRs are out of the X axis boundaries in this exerpt.

I have given each sample its own ID in the samplesheet.tsv. The .nextflow.log files for the run of the high methylation level background where no DMRs were found and the run with the low background where the last step is skipped are attached to this issue.

Am I missing something obvious? I have tried playing with the MR and DMR parameters a bit but had no luck. If you need anything else from me please just let me know. I would be very grateful if you could help me out. Thanks in advance.

nextflow_high_bg.log nextflow_low_bg.log

joehagmann commented 1 year ago

Hi @mkpython3

MethylScore was designed to target larger DMRs. It could well be that your rather small simulated DMRs fall below the thresholds, or that they for that reason fail the statistical test. Could you try either to 1) enlarge the simulated DMRs, or 2) to adjust some parameters, e.g. some default values seem not suitable for your experiment:

However, I bet that enlarging the simulated DMRs is more promising. To assess that, actually you could check if your MR files show a single MR in the visualized example above, i.e. if you find MR breakpoints at the simulated DMR breakpoints. If not, the MRs likely span the whole region and the 5 variable Cs are not enough to be detected.

Let us know if that helps.

mkpython3 commented 1 year ago

Hi,

thank you for your suggestions. Sadly adjusting the parameters did not help finding any of the small DMRs. Therefore I increased the size of the simulated DMRs to 50 CG sites. Now I am able to find my DMRs with a F1-Score of around 0.7. However I have noticed that it only works reliably when the methylation background is low (beta distribution a=1, b=10) and the methylation level in DMRs is high (beta dist a=10, b=1). When the background is high and the DMRs are low the F1 Score decreases to 0.5 or lower, in some cases 0. This does not seem to happen in the CHG and CHH context, at least not from my testing. Now I am a bit worried because in Barley, the organism I work with, the average CG methylation level is quite high (almost 90%). Is it possible that this observation is linked to the design of Methylscore? Is the computation method generally different for the contexts? Do you have any recommendations for adjusting the parameters again?

Btw I use this formula for the F1 Score: (2 TP) / (2 TP + FP + FN)

Best Regards and thank you in advance Marius