Closed ievarau closed 10 months ago
Just to clarify my confusion. Output file rec_pos.txt has the motif and motif name. From motif name I can know, which matrix that is, but is the motif id (Partner 1, ...) the same as hocomocoID that is used in sites files (named, for example, real_hocomoco393_thr5). O how do I get the sequence names, where the partner motif was found?
The motifs comes from Hocomoco 2018 and Plant Cistrome (http://neomorph.salk.edu/dap_web/pages/index.php). The listing of all motifs are given in pfm_list.h file, e.g. for human the 0th motif is Anchor, it respect the homotypic Anchor-Anchor CE, the next 402 motifs are partners, they respect heterotypic CEs Anchor-Partner. Note that a small portions of partner motifs are excluded from analysis, due to too degenerate motifs, e.g. for human the partners 170, 183, 190, 253, 280, 324 are missed in output, this case was described in (Levitsky et al., 2019) Also, file out_hist contains histogram of distribution of composite element (CE) occurrence according to orientations of motifs and spacer/overlap lengths, there you can find both the numbers pf partner motifs and their names.
I didn't quite understand your second comment. But, 1) file rec_pos.txt contains common statistics for five thresholds (a) for the motif frequencies for all sequences, and (b) for the portions of sequences with predicted sites. The numbers and names of motifs are provided
2) recognition profiles are in real*[[N]]_thr5 files. ([N] = 0,1,2...) separate files for each motif, for homotypic CEs (N = 0), or heterotypic CEs (N = 1, 2,...). You can see the listing of sequences, positions, site scores -Log10(ERR) and strands. Seq1, Seg2, etc. mean the first, second, etc. sequences. For example,
Anchor
Seq 1 Thr 0.882475 Nsites 1 612 3.028453 -
Partner
Seq 1 Thr 0.963132 Nsites 3 586 3.450005 - 822 3.588545 - 823 3.414270 +
3) files real_*[N]_thr55.best are lists of predicted CEs, separate file for each pair, for homotypic CEs (N = 0), or heterotypic CEs (N = 1, 2,...). the format is described in detail in manual, it again contains the sequential numbers of sequences, e.g.
Anchor-Partner CEs
Seq A Start A End P Start P End Mutual Loc Loc Type Strands Mutual Ori A Score P Score A Seq P Seq
Seq 1 612 619 586 600 11S Spacer -- DirectAP 3.028453 3.450005 ttgtctct ctattaatcatgatc
Thank you for your answer. It is still quite complex, but I will try to go through the output and understand where is what one more time. But I can also try to explain what my analysis goal is one more time. I want to process the output of MCOT to retrieve the following information:
I am just struggling to navigate the high number of output files.
Could you also clarify what you mean by manual? Is it the README? Because I cannot seem to spot the explanation of format for real_*[N]_thr55.best
files, but I guess you meant <*_thr5>
? Does Nsites 0 mean there was nothing found in that sequence?
And to also note - I am running anchor_vs_many_partners
command.
the lists of ignored matrices mcot.cpp, lines 163-164 int bad_matrix_mm_core[] = { 172, 186, 192, 261, 287, -1 }; int bad_matrix_hs_core[] = { 170, 183, 190, 253, 280, 324, -1 };
Also, Hocomoco is recently updated https://hocomoco12.autosome.org/ , so I hope over the next couple months to update the partner motifs too
The command line for many-partner option:
<7 pvalue_thr> = recognition threshold of motifs transformed to the logarithmic -log10(ERR) scale of Expected Recognition Rate (ERR), ERR is computed as a recognition rate for the whole-genome set o promoters of protein-coding genes, default value 0.0005 <8 -log10[p-value]_thr> = threshold to display the significances of enrichment of CEs in output data (the default value 10)
File out_pval is common statistics for enrichment significance, https://github.com/AcaDemIQ/mcot-kernel/tree/master#output-data,
File
, the summary for statistical significances for all pairs of anchor-partner motifs...
You can sort these data manually, or use our web sever https://webmcot.sysbio.cytogen.ru/,
MCOT has 5 main types of CEs, they respect 5 main columns for CE significance:
Full overlap, -Log10[P-value] Partial overlap,-Log10[P-value] Overlap, -Log10[P-value] Spacer, -Log10[P-value] Any, -Log10[P-value]
these columns are for Anchor-Partner similarity:
Similarity to Anchor, -Log10[P-value] Similarity to Anchor, SSD Similarity to Anchor, PCC
see https://doi.org/10.1093/nar/gkz800
t the README? Because I cannot seem to spot the explanation of format for
real_*[N]_thr55.best
files, but I guess you meant<*_thr5>
? No! https://github.com/AcaDemIQ/mcot-kernel/tree/master#output-data
- Files <*_thr5>, recognition profiles of motifs
- Files <*.best>, the list of predicted CEs.
Another Readme on the web server, https://webmcot.sysbio.cytogen.ru/help, Additional output data there are also for _thr5 & .best possibly it more understandable
Does Nsites 0 mean there was nothing found in that sequence?
Yes, in recognition profiles after '>' this number is marked
And to also note - I am running
anchor_vs_many_partners
command.
Yes, this is right version for library of motifs at once
I am just struggling to navigate the high number of output files.
If it is necessary to speed up the process, we can arrange a consultation through the zoom. this weekend or later The output data are really diverse since CEs have many attributes, e.g. other programs do not consider overlaps of motifs or various conservation of motifs within CEs, but MCOT does. MCOT algorithm and its basic novelty were declared in NAR 2019 paper, https://doi.org/10.1093/nar/gkz800 They were extended to analyze the heterotypic asymmetric CE in IJMS 2020 paper, https://doi.org/10.3390/ijms21176023 Basic concept of MCOT was explained again in 2022 web server paper, https://doi.org/10.3390/ijms23168981 In 2023 the MCOT algorithm was further updated toward analysis of homotypic asymmetric CE. I hope the paper will be published in 2024, see section https://www.preprints.org/manuscript/202311.1617/v1 see section 4.3. Composite elements analysis,
Thanks so much for your detailed answer. I think I managed to find a way to extract the information I want :) Thanks a lot for uploading the motif collection. It was very helpful! :)
Dear @ievarau , can you close this issue?
Best regards, Aleksey.
Dear authors of MCOT,
As far as I understand, your motif libraries are encoded in includes "*.h". The motifs there are just numbered. I understand the motifs are coming from HOCOMOCO, but is any kind of annotation available to connect your motif numbers with the actual names of the matrices?
Thanks in advance for your answer.
Best, Ieva