gem-pasteur / macsyfinder

MacSyFinder - Detection of macromolecular systems in protein datasets using systems modelling and similarity search.
GNU General Public License v3.0
51 stars 17 forks source link

Bug? Unordered db_type does not consider genes from system_ref #15

Closed lguy closed 3 years ago

lguy commented 6 years ago

Hi, First, thanks a lot for providing and maintaining macsyfinder, it's a very useful resource! I stumbled upon what I think is a bug. I was running macsyfinder with unordered and ordered_replicon modes on the same genome. Genes that were in a referred system were not picked up (marked as absent) by macsyfinder in unordered mode, whereas they were present in the ordered one. The hmmer files were identical. I'm screening Legionella pneumophila with the pT4SSi model (TXSS). Here is some code to reproduce the issue:

################################################################################
# Testing macsyfinder, db-types
################################################################################
mkdir test; cd test; mkdir fasta
TXSSPATH=/usr/local/bin/macsy/Macsyfinder_models/Data/TXSS

# Get test data
FASTA=GCA_000008485.1_ASM848v1_protein.faa
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/008/485/GCA_000008485.1_ASM848v1/$FASTA.gz
gunzip $FASTA.gz
mv $FASTA fasta/

# Create a fake conf file
echo -e "[general]\nlog_level = 10\nworker_nb = 10" > log10.conf
echo -e "[general]\nlog_level = 20\nworker_nb = 10" > log20.conf

# Run with ordererd_replicon
for TYPE in ordered_replicon unordered_replicon unordered; do 
    macsyfinder pT4SSi --db-type $TYPE -d $TXSSPATH/DEF -p $TXSSPATH/HMM --profile-suffix .hmm --sequence-db fasta/$FASTA --replicon-topology circular --coverage-profile 0.33 --i-evalue-select 0.00001 --out-dir $TYPE --config log20.conf
done

Then the log files show for the ordered_replicon mode:

====> Decision rule for putative system pT4SSi:
Mandatory genes: 
T4SS_virb4      1
Accessory genes: 
T4SS_t4cp1      1
T4SS_I_traY     0
T4SS_I_traW     0
T4SS_I_traT     0
T4SS_I_traR     0
T4SS_I_traQ     0
T4SS_I_traP     1
T4SS_I_traO     1
T4SS_I_traN     1
T4SS_I_traM     1
T4SS_I_traL     0
T4SS_I_traK     1
T4SS_I_traI     0
T4SS_I_trbA     1
T4SS_I_trbB     0
T4SS_I_traE     0
Forbidden genes: 
T4SS_MOBB       0

For the unordered/unordered_replicon one:

******************************************
pT4SSi
******************************************
Mandatory genes: 
T4SS_virb4      0
Accessory genes: 
T4SS_t4cp1      0
T4SS_I_traY     1
T4SS_I_traW     0
T4SS_I_traT     0
T4SS_I_traR     0
T4SS_I_traQ     0
T4SS_I_traP     1
T4SS_I_traO     1
T4SS_I_traN     1
T4SS_I_traM     4
T4SS_I_traL     0
T4SS_I_traK     1
T4SS_I_traI     1
T4SS_I_trbA     1
T4SS_I_trbB     0
T4SS_I_traE     0
Forbidden genes: 
T4SS_MOBB       0

My python skills are quite basic, but I could narrow the problem to the search_systems function in search_systems.py:

            if k in systems:
                # SO new: get the list of forbidden genes... Then have from hits 
                # Should better rewrite this part of the code to have a single process of the hits...
                #forbidden_genes = k.forbidden_genes # unused_var
                forbidden_hits = []
                for h in hits:
                    if h.gene.is_forbidden(k):
                        forbidden_hits.append(h)
                sub_hits = list(g) + forbidden_hits

                so = SystemOccurence(k)
                #resy=so.fill_with_hits(sub_hits) # does not return anything
                #so.fill_with_hits(sub_hits)
                so.fill_with_hits(sub_hits, True) # SO new parameter to say wether forbidden genes should be included or not. 
                _log_out.info("******************************************")
                _log_out.info(k.name)
                _log_out.info("******************************************")
                _log_out.info(so)
                systems_occurences_list.append(so)

There, it seems to me (but again, I'm a fool in python) that filling the so with sub_hits will not include the hits from the referred system (CONJ here). Using an ugly hack:

                so.fill_with_hits(hits, True) 

temporarily fixed the problem by including all hits in the so, but that's not (I think) a good solution. Hope that helps...

saphia commented 6 years ago

Dear Lionel,

Thanks a lot for your feedback, and for your interest in our tools! Yes, this helps!

We are aware of several bugs/unexpected behaviors of MacSyFinder (including the verbosity issue and the output file inconsistency you've reported in issues #13 and issues #14), we are currently working on a new version of MacSyFinder that should address the issues you have raised. We don't have a timeline yet but expect to provide soon with a new release that fixes (at least some) the issues you've reported.

As for the special issue you report here, I will look into it. Please be aware that the unordered issue can be tricky to deal with when analyzing secretion systems whose components can be from several systems, as it cannot decipher which are part of which system for sure, without genomic context. The "ordered" mode is much more powerful, but of course, one cannot always have access to the gene order... Also, if you are particularly interested in a special T4SS type, it could worth it for you to re-design a bit the XML model files so that it fits better your needs.

Thanks again,

Best wishes,

Sophie Abby

On Wed, Nov 22, 2017 at 2:08 PM, Lionel Guy notifications@github.com wrote:

Hi, First, thanks a lot for providing and maintaining macsyfinder, it's a very useful resource! I stumbled upon what I think is a bug. I was running macsyfinder with unordered and ordered_replicon modes on the same genome. Genes that were in a referred system were not picked up (marked as absent) by macsyfinder in unordered mode, whereas they were present in the ordered one. The hmmer files were identical. I'm screening Legionella pneumophila with the pT4SSi model (TXSS). Here is some code to reproduce the issue:

################################################################################

Testing macsyfinder, db-types

################################################################################ mkdir test; cd test; mkdir fasta TXSSPATH=/usr/local/bin/macsy/Macsyfinder_models/Data/TXSS

Get test data

FASTA=GCA_000008485.1_ASM848v1_protein.faa wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/008/485/GCA_000008485.1_ASM848v1/$FASTA.gz gunzip $FASTA.gz mv $FASTA fasta/

Create a fake conf file

echo -e "[general]\nlog_level = 10\nworker_nb = 10" > log10.conf echo -e "[general]\nlog_level = 20\nworker_nb = 10" > log20.conf

Run with ordererd_replicon

for TYPE in ordered_replicon unordered_replicon unordered; do macsyfinder pT4SSi --db-type $TYPE -d $TXSSPATH/DEF -p $TXSSPATH/HMM --profile-suffix .hmm --sequence-db fasta/$FASTA --replicon-topology circular --coverage-profile 0.33 --i-evalue-select 0.00001 --out-dir $TYPE --config log20.conf done

Then the log files show for the ordered_replicon mode:

====> Decision rule for putative system pT4SSi: Mandatory genes: T4SS_virb4 1 Accessory genes: T4SS_t4cp1 1 T4SS_I_traY 0 T4SS_I_traW 0 T4SS_I_traT 0 T4SS_I_traR 0 T4SS_I_traQ 0 T4SS_I_traP 1 T4SS_I_traO 1 T4SS_I_traN 1 T4SS_I_traM 1 T4SS_I_traL 0 T4SS_I_traK 1 T4SS_I_traI 0 T4SS_I_trbA 1 T4SS_I_trbB 0 T4SS_I_traE 0 Forbidden genes: T4SS_MOBB 0

For the unordered/unordered_replicon one:


pT4SSi


Mandatory genes: T4SS_virb4 0 Accessory genes: T4SS_t4cp1 0 T4SS_I_traY 1 T4SS_I_traW 0 T4SS_I_traT 0 T4SS_I_traR 0 T4SS_I_traQ 0 T4SS_I_traP 1 T4SS_I_traO 1 T4SS_I_traN 1 T4SS_I_traM 4 T4SS_I_traL 0 T4SS_I_traK 1 T4SS_I_traI 1 T4SS_I_trbA 1 T4SS_I_trbB 0 T4SS_I_traE 0 Forbidden genes: T4SS_MOBB 0

My python skills are quite basic, but I could narrow the problem to the search_systems function in search_systems.py:

        if k in systems:
            # SO new: get the list of forbidden genes... Then have from hits
            # Should better rewrite this part of the code to have a single process of the hits...
            #forbidden_genes = k.forbidden_genes # unused_var
            forbidden_hits = []
            for h in hits:
                if h.gene.is_forbidden(k):
                    forbidden_hits.append(h)
            sub_hits = list(g) + forbidden_hits

            so = SystemOccurence(k)
            #resy=so.fill_with_hits(sub_hits) # does not return anything
            #so.fill_with_hits(sub_hits)
            so.fill_with_hits(sub_hits, True) # SO new parameter to say wether forbidden genes should be included or not.
            _log_out.info("******************************************")
            _log_out.info(k.name)
            _log_out.info("******************************************")
            _log_out.info(so)
            systems_occurences_list.append(so)

There, it seems to me (but again, I'm a fool in python) that filling the so with sub_hits will not include the hits from the referred system (CONJ here). Using an ugly hack:

            so.fill_with_hits(hits, True)

temporarily fixed the problem by including all hits in the so, but that's not (I think) a good solution. Hope that helps...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gem-pasteur/macsyfinder/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/AFpVUvhOVxVTlvcjHZpQ7ZtvrPemvrqoks5s5BzLgaJpZM4QnXw_ .

-- Sophie ABBY