gem-pasteur / macsyfinder

MacSyFinder - Detection of macromolecular systems in protein datasets using systems modelling and similarity search.
GNU General Public License v3.0
51 stars 17 forks source link

Define system with two genes that hit the same HMM #16

Closed brymerr921 closed 3 years ago

brymerr921 commented 6 years ago

Hi,

Some of the systems I want to find can be differentiated from systems I'm uninterested in because they have two genes nearby that hit the same HMM. How might I go about defining this in the XML definition file? It seems that "min_mandatory_genes_required" and "min_genes_required" refers not to the hits that come back, but the actual definition.

For example, if my system of interest has one of gene A and two of gene B, it appears that I can't include in my XML definition file just gene A and gene B and say that the min_mandatory_genes_required or min_genes_required is 3. I don't want to identify systems that have just one each of gene A and gene B. I'm sure I can think of a way to parse the outputs to eliminate these, but if I can do a better job of formatting my XML definition file, that would be best. Do you have any suggestions?

Thanks, Bryan

saphia commented 6 years ago

Dear Bryan,

First of all, thanks for your interest in our tool! To answer your question, unfortunately I cannot think of a way to deal with such a case with the current implementation of MacSyFinder. For now, it is designed to count the number of different marker genes found, and does not deal with a number of sequences found to match profiles from the definition. We will start include new features in MacSyFinder beginning of next year, we will think about this particular issue, maybe by adding a gene-level feature "number_copies_required" or something, or maybe even a system-level feature.

However, there might be a way around it... It really depends on your particular system of interest. Are the two sequences matching the same HMM profile very similar? e.g., do they have very similar scores with Hmmer? Or do you think there might be a way to distinguish them? Cause it could be possible to design a second profile in the case they are not super similar. I can give you details if you are interested. Otherwise, yes, the .summary file for instance could be parsed easily (e.g., there are actually in some columns text-version of Python dictionary with counts of the genes you can directly read and import as dictionary) to find the cases you are looking for, in that case you will define a loose definition of your system as you suggest, with only two genes required (e.g. min_genes_required = 2), and a posteriori you can check how many occurrences you got for the gene which requires several copies.

Hope this helps,

Kind regards,

Sophie

On Thu, Dec 21, 2017 at 7:30 PM, brymerr921 notifications@github.com wrote:

Hi,

Some of the systems I want to find can be differentiated from systems I'm uninterested in because they have two genes nearby that hit the same HMM. How might I go about defining this in the XML definition file? It seems that "min_mandatory_genes_required" and "min_genes_required" refers not to the hits that come back, but the actual definition.

For example, if my system of interest has one of gene A and two of gene B, it appears that I can't include in my XML definition file just gene A and gene B and say that the min_mandatory_genes_required or min_genes_required is 3. I don't want to identify systems that have just one each of gene A and gene B. I'm sure I can think of a way to parse the outputs to eliminate these, but if I can do a better job of formatting my XML definition file, that would be best. Do you have any suggestions?

Thanks, Bryan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gem-pasteur/macsyfinder/issues/16, or mute the thread https://github.com/notifications/unsubscribe-auth/AFpVUo4lRB5zYodIgKWmCoMwaw_BTIAzks5tCqPZgaJpZM4RKMQ7 .

-- Sophie ABBY

jeanrjc commented 6 years ago

Hello,

If I understood well, maybe you could just copy your .HMM file and give it another name, and then use it in the definition. For instance:

- HMM
    \__ geneA.hmm
    \__ geneB.hmm
    \__ geneBb.hmm # a strict copy of geneB.hmm
- def
    \__ my_system.xml

And my_system.xml contains:

<system min_genes_required="3">  
    <gene name="geneA" presence="mandatory" />
    <gene name="geneB" presence="mandatory" />
    <gene name="geneBb" presence="mandatory" />
</system>

I believe macsyfinder won't see the difference, but I may be wrong. One issue might be that there will be 2 hits of geneB (instead of geneB and geneBb), and that macsyfinder will output an incomplete system. Let us know if you try that (provided it was your actual problem). If it works, it might even work with symlinks, which would be more convenient instead of copying an entire set of HMM profiles. Anyway, the gene level feature would be better.

Best, Jean

brymerr921 commented 6 years ago

Sophie and Jean,

Thanks so much for your replies.

Sophie, I can't think of an easy way to distinguish the two genes, as they are really quite similar to each other within a genome. One HMM profile is really the best way to identify either of these genes which sit next to each other.

Jean, you described my problem exactly. I tried your suggestion, but it gave me an incomplete system because it just detected two of "geneB". If I put "geneBb" first in the XML file, it detected two of "geneBb" and zero of "geneB". It appears that when evaluating "min_mandatory_genes_required" and "min_genes_required" it's going off the number of unique genes, and not the sum of genes that have hits. Also, it looks like when two identical but differently named HMMs are requested, it reports just the first one.

I also tried to make "geneBb" an exchangeable homolog of "geneB" as well as "geneB" and exchangeable homolog of "geneBb" and switched "min_genes_required" to 3, and it still failed. I also changed "geneBb" to accessory, and it still reported an incomplete system.

Example system definition file:

<system inter_gene_max_space="0" min_mandatory_genes_required="2" min_genes_required="3">

<gene name="geneA" presence="mandatory"/>
<gene name="geneB" presence="mandatory" exchangeable="1">
  <homologs>
    <gene name="geneBb"/>
  </homologs>
</gene>
<gene name="geneBb" presence="mandatory" exchangeable="1">
  <homologs>
    <gene name="geneB"/>
  </homologs>
</gene>
</system>

I'm happy to test any other suggestions, and will look at parsing the .summary file as you suggested, Sophie. Also, the "number_copies_required" option sounds amazing! Please let me know if you have any questions or if there's anything I can help with.

Also, thanks for this wonderful program! It's going to make a huge difference in the research I'm doing.

Best, Bryan

brymerr921 commented 6 years ago

One small update: I'm also searching for systems defined by two genes that are next to each other where both match the same HMM profile. When I define a system containing just one HMM, the program fails to detect a system and I can't get a report file that I can then parse, aside from the macsyfinder.out file.

brymerr921 commented 6 years ago

A workaround for identifying two genes next to each other where both match the same profile can be done by defining the following system:

<system inter_gene_max_space="0" min_mandatory_genes_required="1" min_genes_required="1">
<gene name="geneA" presence="mandatory" loner="1"/>\
</system>

The output contains all instances of this gene whether they are alone or side-by-side and I can use this to only extract those genes that have two copies side-by-side in the same system. I'll do this for now.

If I remove loner="1" it appears that the only systems detected are two genes side-by-side but an empty report is generated. Output:

************************************
 Analyzing clusters
************************************

--- Cluster geneA ? ---
['405', '406']
['geneA', 'geneA']
[398, 399]
------- next -------

--- Cluster geneA ? ---
['1544', '1545']
['geneA', 'geneA']
[1506, 1507]
------- next -------

--- Cluster geneA ? ---
['1585', '1586']
['geneA', 'geneA']
[1547, 1548]
------- next -------

--- Cluster geneA ? ---
['2637', '2638']
['geneA', 'geneA']
[2572, 2573]
------- next -------

***************************************************
******* Report scattered/uncomplete systems *******
***************************************************

******************************************

******************************************
Building reports of detected systems

Counter()
******************************************