DMRs have multiple entries, methylation rate missing values

mkpython3 commented 1 year ago

Hi,

I used Methylscore with default parameters to call DMRs between 23 samples separately for each context. I have noticed that some DMRs show up multiple times with one of their "methylation rate" columns (Nr. 6+) missing a value. This might be a side effect of me changing the pipeline from using tbi index to csi. The change was necessary to handle larger genomes which are not supported by tbi, however only a small proportion of the DMRs shows this weird behavior, therefore I am unsure if it is actually caused by the different index.

Here is an example:

chr1H   83540   84016   477 -3-2-.3-.21--223333-212 1:  2:28    3:64    1:HvBS013,HvBS024   2:HvBS005,HvBS012,HvBS016,HvBS017,HvBS023,HvBS048   3:HvBS002,HvBS008,HvBS018,HvBS019,HvBS020,HvBS021   #:26    CG  -
chr1H   83540   84016   477 -3-2-.3-.21--223333-212 1:89    2:  3:64    1:HvBS013,HvBS024   2:HvBS005,HvBS012,HvBS016,HvBS017,HvBS023,HvBS048   3:HvBS002,HvBS008,HvBS018,HvBS019,HvBS020,HvBS021   #:26    CG  -
chr1H   83540   84016   477 -3-2-.3-.21--223333-212 1:89    2:28    3:  1:HvBS013,HvBS024   2:HvBS005,HvBS012,HvBS016,HvBS017,HvBS023,HvBS048   3:HvBS002,HvBS008,HvBS018,HvBS019,HvBS020,HvBS021   #:26    CG  CG

The information in these lines is identical except for the methylation rate columns and the last column.

Do you have an idea of the exact cause or could you help me create a patch to use csi index correctly as it is necessary to handle sequences longer than 512 Mbp.

Thanks in advance, Marius

harmn commented 1 year ago

I ran into the same issue, which seems like the result of incomplete merging after differential methylation testing.

I think the cause is that one of the intermediate files is not guaranteed to be sorted on chromosome and position, while one of the subsequent steps assumes it is.

The differential methylation testing is done in a pairwise fashion between clusters by the SEGMENTS-contexts.pl script, and each comparison produces a line in the intermediate segments.dif file. So for one MR with three clusters that gives three lines.

Some steps later, the merge_DMRs-contexts.pl script combines these lines into one line per DMR to produce the final BED files. This merge_DMRs-contexts.pl script assumes that these lines are together, but for me that was not always the case. My guess is that this is due to the fact that the pairwise cluster comparisons run in parallel on four threads and sometimes a thread for the next MR already finishes before all threads of the previous MR have finished.

The SEGMENTS-contexts.pl script actually contains code to sort the segments.dif file, so I assume the issue surfaced during development, but that code is only run when the post_process flag is set (line 475), which is disabled in the call_DMRs.nf nextflow file with the "--no-post-process" option. The post-process step seems to involve more that just sorting, so enabling it might be tricky.

Forcing the pipeline to only use a single CPU might help to avoid the issue.

mkpython3 commented 1 year ago

Thank you for your insight, this might prove very useful for me.

Yes, meanwhile I also figured that it has nothing to do with my code edit. I have resorted to merging those lines manually. The only columns that may be different in those lines are the mean methylation per cluster columns, the number of sites columns (starting with #:) and the last column. The mean methylation columns never show different values, one is just always missing. The number of sites columns on the other hand can show different values. During merging, I am going with the maximum number there.

But I also found some special cases where the number of rows is neither equal to the number of clusters nor the number of possible pairwise combinations:

chr1H   190573  191057  485 41342411414121113412241 1:  2:  3:64    4:90    1:HvBS002,HvBS008,HvBS009,HvBS012,HvBS014,HvBS016,HvBS017,HvBS018,HvBS021,HvBS048   2:HvBS006,HvBS015,HvBS022,HvBS023   3:HvBS004,HvBS019   4:HvBS001,HvBS005,HvBS007,HvBS011,HvBS013,HvBS020,HvBS024   #:40    CG  -
chr1H   190573  191057  485 41342411414121113412241 1:  2:27    3:  4:90    1:HvBS002,HvBS008,HvBS009,HvBS012,HvBS014,HvBS016,HvBS017,HvBS018,HvBS021,HvBS048   2:HvBS006,HvBS015,HvBS022,HvBS023   3:HvBS004,HvBS019   4:HvBS001,HvBS005,HvBS007,HvBS011,HvBS013,HvBS020,HvBS024   #:40    CG  CG
chr1H   190573  191057  485 41342411414121113412241 1:  2:27    3:64    4:  1:HvBS002,HvBS008,HvBS009,HvBS012,HvBS014,HvBS016,HvBS017,HvBS018,HvBS021,HvBS048   2:HvBS006,HvBS015,HvBS022,HvBS023   3:HvBS004,HvBS019   4:HvBS001,HvBS005,HvBS007,HvBS011,HvBS013,HvBS020,HvBS024   #:40    CG  -
chr1H   190573  191057  485 41342411414121113412241 1:7 2:  3:  4:90    1:HvBS002,HvBS008,HvBS009,HvBS012,HvBS014,HvBS016,HvBS017,HvBS018,HvBS021,HvBS048   2:HvBS006,HvBS015,HvBS022,HvBS023   3:HvBS004,HvBS019   4:HvBS001,HvBS005,HvBS007,HvBS011,HvBS013,HvBS020,HvBS024   #:40    CG  CG
chr1H   190573  191057  485 41342411414121113412241 1:7 2:27    3:64    4:  1:HvBS002,HvBS008,HvBS009,HvBS012,HvBS014,HvBS016,HvBS017,HvBS018,HvBS021,HvBS048   2:HvBS006,HvBS015,HvBS022,HvBS023   3:HvBS004,HvBS019   4:HvBS001,HvBS005,HvBS007,HvBS011,HvBS013,HvBS020,HvBS024   #:40    CG  CG

Is there any way to tell Nextflow to run everything in single core? I will now try running Methylscore on a computation node with only one core and see if this helps.

Best Regards Marius

harmn commented 1 year ago

The code it will merge lines for the same MR that are next to each other in the intermediate segments.sif file (check the "work" directory) and in your example it seems that only the 1vs2 and 1vs3 pairwise comparisons where merged, so the lines for the other comparisons were separated by lines from other MRs (multithreading makes the order unpredictable).

The cores seem to be configured in the base,.config file, on my system in: ~/.nextflow/assets/Computomics/MethylScore/conf/base.config

The call_DMR step has label "resource_medium", so you could change the configuration block below and set cpus to 1

withLabel: resource_medium {
    cpus   = 4
    memory = { 16.GB * task.attempt }
    time   = { 6.h   * task.attempt }
  }

mkpython3 commented 1 year ago

Thank you for your help. Running Methylscore with cpu = 1 in the resource_medium label fixed the problem for me.

Computomics / MethylScore

DMRs have multiple entries, methylation rate missing values #5