HS core motif library - Githubissues

ievarau commented 11 months ago

Dear authors of MCOT,

As far as I understand, your motif libraries are encoded in includes "*.h". The motifs there are just numbered. I understand the motifs are coming from HOCOMOCO, but is any kind of annotation available to connect your motif numbers with the actual names of the matrices?

Thanks in advance for your answer.

Best, Ieva

ievarau commented 11 months ago

Just to clarify my confusion. Output file rec_pos.txt has the motif and motif name. From motif name I can know, which matrix that is, but is the motif id (Partner 1, ...) the same as hocomocoID that is used in sites files (named, for example, real_hocomoco393_thr5). O how do I get the sequence names, where the partner motif was found?

parthian-sterlet commented 11 months ago

The motifs comes from Hocomoco 2018 and Plant Cistrome (http://neomorph.salk.edu/dap_web/pages/index.php). The listing of all motifs are given in pfm_list.h file, e.g. for human the 0th motif is Anchor, it respect the homotypic Anchor-Anchor CE, the next 402 motifs are partners, they respect heterotypic CEs Anchor-Partner. Note that a small portions of partner motifs are excluded from analysis, due to too degenerate motifs, e.g. for human the partners 170, 183, 190, 253, 280, 324 are missed in output, this case was described in (Levitsky et al., 2019) Also, file out_hist contains histogram of distribution of composite element (CE) occurrence according to orientations of motifs and spacer/overlap lengths, there you can find both the numbers pf partner motifs and their names.

parthian-sterlet commented 11 months ago

I didn't quite understand your second comment. But, 1) file rec_pos.txt contains common statistics for five thresholds (a) for the motif frequencies for all sequences, and (b) for the portions of sequences with predicted sites. The numbers and names of motifs are provided

2) recognition profiles are in real*[[N]]_thr5 files. ([N] = 0,1,2...) separate files for each motif, for homotypic CEs (N = 0), or heterotypic CEs (N = 1, 2,...). You can see the listing of sequences, positions, site scores -Log10(ERR) and strands. Seq1, Seg2, etc. mean the first, second, etc. sequences. For example,

Anchor

Seq 1 Thr 0.882475 Nsites 1 612 3.028453 -

Partner

Seq 1 Thr 0.963132 Nsites 3 586 3.450005 - 822 3.588545 - 823 3.414270 +

3) files real_*[N]_thr55.best are lists of predicted CEs, separate file for each pair, for homotypic CEs (N = 0), or heterotypic CEs (N = 1, 2,...). the format is described in detail in manual, it again contains the sequential numbers of sequences, e.g.

Anchor-Partner CEs

Seq A Start A End P Start P End Mutual Loc Loc Type Strands Mutual Ori A Score P Score A Seq P Seq

Seq 1 612 619 586 600 11S Spacer -- DirectAP 3.028453 3.450005 ttgtctct ctattaatcatgatc

ievarau commented 11 months ago

Thank you for your answer. It is still quite complex, but I will try to go through the output and understand where is what one more time. But I can also try to explain what my analysis goal is one more time. I want to process the output of MCOT to retrieve the following information:

partner motif (the actual frequency matrix)
the p-value or some metric that would say if a partner motif is significant (I want only significant)
The names of the original input sequences, where that motif was found.

I am just struggling to navigate the high number of output files.

ievarau commented 11 months ago

Could you also clarify what you mean by manual? Is it the README? Because I cannot seem to spot the explanation of format for real_*[N]_thr55.best files, but I guess you meant <*_thr5>? Does Nsites 0 mean there was nothing found in that sequence?

And to also note - I am running anchor_vs_many_partners command.

parthian-sterlet commented 11 months ago

partner motif (the actual frequency matrix) anchor_vs_one version (mcot_anchor.cpp) has two frequency matrices in input data (anchor & partner) anchor_vs_many version (mcot.cpp) has no frequency matrices of partner in input data, to fasten calculation they all were transformed to weight matrices, see files dapseq_pwm.h hocomoco_pwm_hs_core.h hocomoco_pwm_hs_full.h hocomoco_pwm_mm_core.h hocomoco_pwm_mm_full.h But, original hocomoco frequency matrices were downloaded from https://hocomoco11.autosome.org/downloads_v11 CORE COLLECTION, Human Mononucleotide, Mouse Mononucleotide. https://hocomoco11.autosome.org/final_bundle/hocomoco11/core/HUMAN/mono/HOCOMOCOv11_core_pcm_HUMAN_mono.tar.gz https://hocomoco11.autosome.org/final_bundle/hocomoco11/core/MOUSE/mono/HOCOMOCOv11_core_pcm_MOUSE_mono.tar.gz However, currently mcot uses the frozen hocomoco versions from 2018 (a few matrices there were changed or deleted by hocomoco team, I haven't kept up with all these updates), so exact frequency matrices for hcomoco human & mouse collections, & dapseq of 402/358/528 matrices I put there https://github.com/parthian-sterlet/mcot-kernel/blob/master/examples/many/matrices.7z

the lists of ignored matrices mcot.cpp, lines 163-164 int bad_matrix_mm_core[] = { 172, 186, 192, 261, 287, -1 }; int bad_matrix_hs_core[] = { 170, 183, 190, 253, 280, 324, -1 };

Also, Hocomoco is recently updated https://hocomoco12.autosome.org/ , so I hope over the next couple months to update the partner motifs too

parthian-sterlet commented 11 months ago

the p-value or some metric that would say if a partner motif is significant (I want only significant) https://github.com/AcaDemIQ/mcot-kernel/tree/master#command-line-arguments

The command line for many-partner option:
<7 pvalue_thr> = recognition threshold of motifs transformed to the logarithmic -log10(ERR) scale of Expected Recognition Rate (ERR), ERR is computed as a recognition rate for the whole-genome set o promoters of protein-coding genes, default value 0.0005 <8 -log10[p-value]_thr> = threshold to display the significances of enrichment of CEs in output data (the default value 10)

File out_pval is common statistics for enrichment significance, https://github.com/AcaDemIQ/mcot-kernel/tree/master#output-data,

File , the summary for statistical significances for all pairs of anchor-partner motifs...

You can sort these data manually, or use our web sever https://webmcot.sysbio.cytogen.ru/, MCOT has 5 main types of CEs, they respect 5 main columns for CE significance: Full overlap, -Log10[P-value] Partial overlap,-Log10[P-value] Overlap, -Log10[P-value] Spacer, -Log10[P-value] Any, -Log10[P-value]
these columns are for Anchor-Partner similarity: Similarity to Anchor, -Log10[P-value] Similarity to Anchor, SSD Similarity to Anchor, PCC
see https://doi.org/10.1093/nar/gkz800

parthian-sterlet commented 11 months ago

The names of the original input sequences, where that motif was found. You should extract names of sequences from FASTA on your own, since MCOT does not use name, MCOT uses sequential number in all files , Seq 1, Seq 2, etc This refers to motif recognition files real_hocomoco[N}thr5 & CE list files real[N}_thr55.best

parthian-sterlet commented 11 months ago

t the README? Because I cannot seem to spot the explanation of format for real_*[N]_thr55.best files, but I guess you meant <*_thr5>? No! https://github.com/AcaDemIQ/mcot-kernel/tree/master#output-data

Files <*_thr5>, recognition profiles of motifs

Files <*.best>, the list of predicted CEs.

Another Readme on the web server, https://webmcot.sysbio.cytogen.ru/help, Additional output data there are also for _thr5 & .best possibly it more understandable

parthian-sterlet commented 11 months ago

Does Nsites 0 mean there was nothing found in that sequence?

Yes, in recognition profiles after '>' this number is marked

And to also note - I am running anchor_vs_many_partners command.

Yes, this is right version for library of motifs at once

parthian-sterlet commented 11 months ago

I am just struggling to navigate the high number of output files.

If it is necessary to speed up the process, we can arrange a consultation through the zoom. this weekend or later The output data are really diverse since CEs have many attributes, e.g. other programs do not consider overlaps of motifs or various conservation of motifs within CEs, but MCOT does. MCOT algorithm and its basic novelty were declared in NAR 2019 paper, https://doi.org/10.1093/nar/gkz800 They were extended to analyze the heterotypic asymmetric CE in IJMS 2020 paper, https://doi.org/10.3390/ijms21176023 Basic concept of MCOT was explained again in 2022 web server paper, https://doi.org/10.3390/ijms23168981 In 2023 the MCOT algorithm was further updated toward analysis of homotypic asymmetric CE. I hope the paper will be published in 2024, see section https://www.preprints.org/manuscript/202311.1617/v1 see section 4.3. Composite elements analysis,

ievarau commented 11 months ago

Thanks so much for your detailed answer. I think I managed to find a way to extract the information I want :) Thanks a lot for uploading the motif collection. It was very helpful! :)

AcaDemIQ commented 11 months ago

Dear @ievarau , can you close this issue?

Best regards, Aleksey.

AcaDemIQ / mcot-kernel

HS core motif library #9

Seq A Start A End P Start P End Mutual Loc Loc Type Strands Mutual Ori A Score P Score A Seq P Seq