davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
679 stars 186 forks source link

Genes from proteome/species not descendant of Nx.tsv are present in Nx.tsv. #602

Closed matrs closed 3 years ago

matrs commented 3 years ago

Hello, I'm trying to define single-copy orthogroups from the Nx.tsv files. i'm getting results that I consider confusing, so I wrote a couple of lines to check if a specific Nx.tsv has only genes pertaining to its descendants species, which I'm expecting. Let's say I take the N11.tsv, I see the descendants species of this node in the species tree and I see two:

['MGYG-HGUT-04532',
 'DGYMR06203__metabat2_low_PE']

Then, I loop over all the Nx.tsv files and I check the column MGYG-HGUT-04532 every time. I'm expecting to get genes only in the N11.tsv file and its ancestors:

[Tree node 'N7' (0x7f514471e49),
 Tree node 'N3' (0x7f5147961be),
 Tree node 'N1' (0x7f51478373a),
 Tree node 'N0' (0x7f514471e46)]
nodes = [f'N{n}.tsv' for n in range(194)]
for n in nodes:
    n_df = pd.read_csv(root.joinpath(n), sep='\t', na_filter=False)
    print(n, n_df.loc[:, 'MGYG-HGUT-04532'].unique(), sep='\n')

Which produces:

N0.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
 'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
 'GFNMCGMP_00381, GFNMCGMP_00380']
N1.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
 'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
 'GFNMCGMP_00381, GFNMCGMP_00380']
N2.tsv
['']
N3.tsv
['' 'GFNMCGMP_00924, GFNMCGMP_01074, GFNMCGMP_01611'
 'GFNMCGMP_00164, GFNMCGMP_00168' ... 'GFNMCGMP_00320' 'GFNMCGMP_00321'
 'GFNMCGMP_00381, GFNMCGMP_00380']
N4.tsv
['']
N5.tsv
['']
N6.tsv
['']
N7.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_00320'
 'GFNMCGMP_00321' 'GFNMCGMP_00381, GFNMCGMP_00380']
N8.tsv
['']
N9.tsv
['']
N10.tsv
['']
N11.tsv
['' 'GFNMCGMP_00750, GFNMCGMP_00293' 'GFNMCGMP_00570'
 'GFNMCGMP_01197, GFNMCGMP_00667' 'GFNMCGMP_00341'
...]
N12.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
N13.tsv
['']
N14.tsv
['']
... empty lists
['']
N20.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
N29.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
...
followed by  empty lists

So N12, N20 and N29.tsv show genes for MGYG-HGUT-04532, although none of these nodes are descendants/ancestors of N11. I tried with other species and nodes, but It's always the same. Maybe I'm misunderstanding how this works and I'd appreciate any help. I'm attaching the tree file and a couple of Nx.tsv.

I'm running orthofinder 2.5.2

Jose Luis

SpeciesTree_rooted_node_labels.txt

Ns.zip

davidemms commented 3 years ago

Hi Jose Luis

That's very strange, these *.tsv files don't seem to correspond at all to the SpeciesTree_rooted_node_labels.txt file. E.g.

N12.tsv contains genes from 4 species: MGYG-HGUT-04532, bin3c.184.contigs, X355_Hoffmanns_Two_toed_Sloth__metabat2_high_PE.021.contigs & GCF_001683795.1_ASM168379v1_genomic, but these species are distributed quite widely across the attached species tree.

And the same for N29.tsv.

Could you describe the steps taken in OrthoFinder to produce these? Was it just a single run from the start, what commands did you use?

All the best David

matrs commented 3 years ago

Hello David, thank you very much for your prompt answer. the previous files come from a run which uses previous orthofinder runs (I tested a few options). To help find what the problem is, I'm attaching another related run which has this exact same problem but uses the "original run" directly. So the original run here is Jul21, which doesn't appear to have this problem. That run was:

orthofinder -f faas -t 28 -a 8

Then, using those results I ran:

orthofinder -b Results_Jul21 -f  extra_faas -M msa -y -t 28 -a 8

Which created files that have the problem (I'm attaching them with the log, Jul29). This last run added 3 genomes and removed one, genome 36 in the log file. (the files attached in the original post come from this run but specifying a tree, -ft -s)

For example, when looking at the N3 node in this jul29 tree:

tree.search_nodes(name='N3')[0].get_leaf_names()
[ ]: ['MGYG-HGUT-04532', 'DGYMR06203__metabat2_low_PE.047.contigs']
tree.search_nodes(name='N3')[0].get_ancestors()
[ ]: [Tree node 'N1' (0x7f07b1822cd), Tree node 'N0' (0x7f07b25387f)]

Then looking to the Ns files and MGYG-HGUT-04532, I get N4, N7 and N11 too:

N4.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
N5.tsv
['']
N6.tsv
['']
N7.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']
N8.tsv
['']
N9.tsv
['']
N10.tsv
['']
N11.tsv
['' 'GFNMCGMP_00924' 'GFNMCGMP_01074' ... 'GFNMCGMP_01376'
 'GFNMCGMP_01174' 'GFNMCGMP_00331']

I'm attaching a few files, but in this drive folder are some of the results for both runs https://drive.google.com/drive/folders/1CELoUvE1w87FmFNXzos1__GFHpNDN_f1?usp=sharing

I hope this helps and let me know If any other file/information is needed.

Log_jul29.txt SpeciesTree_rooted_node_labels_jul29.txt Log_jul21.txt

davidemms commented 3 years ago

Hi Jose Luis

This should now be fixed, you can regenerate the correct results just by running with the 'from trees' option on the final results directory which had the added species: "-ft Results_Jul29/". Thanks again for reporting this.

All the best David

matrs commented 3 years ago

Hi David, I tried the last code and It seems to work as expected. Thanks !