arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
81 stars 41 forks source link

Problem with gene families containing 2 sequences ? #9

Closed BiodivGenomic closed 5 years ago

BiodivGenomic commented 5 years ago

Hi, I would like to know if the smallest multi-copies families can be an issue for wgd... I can imagine phyml will maybe run into trouble building a tree with only 2 sequences, but is wgd dealing with these families ? I ask this question because analyses seems to stall using phyml for the second step (ksd), each time at a gene family containing 2 members... I would like to exclude this issue before troubleshooting more :-)

arzwa commented 5 years ago

Hi,

wgd ksd should take into account this issue. If you run wgd with the -v debug flag you'll see the following when a family with only two members is encountered:

2019-02-04 17:46:55: DEBUG  PhyML breaks with only two genes, do ALC instead.
2019-02-04 17:46:55: DEBUG  Distance will be in Ks units!
2019-02-04 17:46:55: DEBUG  Performing average linkage clustering on Ks values.
2019-02-04 17:46:55: DEBUG  Clustering used for weighting: 
[[0. 1. 0. 2.]]
2019-02-04 17:46:55: DEBUG                               Paralog1     Paralog2     Family        ...          AlignmentLength  AlignmentLengthStripped  AlignmentCoverage
AT5G13590.1__AT5G13590.2  AT5G13590.1  AT5G13590.2  GF_000003        ...                     3504                     3504                1.0

In other words, analyses should not stall when encountering only two family members.

Here is an example data set to show a MWE, sample.mcl:

AT5G24210.1 AT5G24230.3 AT4G10955.1
AT1G01080.3 AT5G53680.1
AT5G13590.1 AT5G13590.2
AT1G02610.1

make sure they are tab separated (sed -i 's/ /\t/g' sample.mcl is your friend after copying)

sample.fasta:

>AT1G01080.3
ATGGCGGCCTCCTGCTTCGCAATTCCCTTATCTTCTTCTTCTCGATCGTCTCACAATGCAATTCCCAAATACAAAACCCTAATCTCTTCTTCTTCTTACTCTTACTTAGAATCTCTGAAACTTCAATTCTCTTCTTCCAATTCTTTTCATCACTCTTCTCTTTCTCGTCCCTTTGTAGCTCAACCACTTCAAATCAAGGTCTCTTCTTCAGAATTATCAGTTCTCGATGAAGAAAAAGAAGAAGAAGTAGTTAAAGGAGAAGCAGAACCCAATAAAGATAGTGTCGTCTCCAAAGCAGAACCAGTAAAGAAACCGAGACCTTGCGAGCTCTACGTGTGTAATATCCCTAGAAGCTACGACATTGCTCAGCTTCTTGACATGTTTCAGCCTTTTGGAACTGTAATCTCTGTAGAGGTATCGCGAAATCCTCAGACGGGAGAGAGCCGTGGAAGCGGGTACGTGACAATGGGTTCTATAAACTCTGCCAAAATCGCCATTGCTTCTCTTGATGGAACAGAAGTAGGTGGTCGGGAAATGCGGGTTAGGTACTCTGTTGACATGAATCCAGGAACAAGAAGAAACCCTGAAGTCTTGAACTCAACTCCAAAGAAGATTCTGATGTACGAAAGCCAACACAAGGTCTATGTCGGAAATCTCCCTTGGTTCACACAGCCTGATGGTTTGAGAAACCACTTTAGCAAGTTTGGCACAATCGTAAGCACGAGAGTGTTACATGATCGTAAGACCGGGAGAAACAGAGTCTTTGCCTTTCTTTCTTTTACAAGCGGTGAAGAACGTGATGCGGCTTTATCATTCAATGGAACAGTTAAGTTCGTTATCCATAAAAAGAATCTTGCTTGA
>AT1G02610.1
ATGGGAGATGTAGTTTTGTTCATAGATGAAACATATTTGAAATCGAGTTTTAATCGCTGTAGAATCTGTCACGAAGAAGAAGCTGAGAGCTACTTTGAAGCTCCTTGTTCTTGTTCAGGAACCATCAAGTTCGCTCACAGAGATTGCATACAACGATGGTGTGATGAGAAAGGAAACACAATTTGTGAAATTTGTCTCCAGGAGTATAAACCTGGATACACCACAACTTCAAAACCATCTCGATTTATTGAAACAGCAGTCACAATCAGAGATAATTTACACATAATGAGAAGAGAAAATGGAAGAAGAAGAAGAAATAGAAGATTAGTGAATAGAGAAGAATCAGATTTTCAAGAATGCAACTCTGGTGTTGATAGAGGCGCCTCTTGTTGTAGATACTTGGCTCTCATTTTTTCGGTTATTTTGTTGATAAAGCATGCATTTGATGCGGTTTATGGAACTGAAGAGTATCCCTATACAATATTCACGGTCTTAACATTGAAGGCTATAGGCATACTATTACCAATGCTCGTTATCATTCGAACCATCACAGCTATTCAAAGGAGTCTCCGATATCAAATTCTCGAATCAGAAGAAGATACATTGAGCTCTGAAGAAGAAGATCATGGTTTGGAGGAGGAAGAGCAACAACAACATATAGCTTGA
>AT5G24210.1
ATGGGAAACTTAAAAAAATCTACACGCAGTGACGAGTTAAGCCGTTCTGGTCCCCCTCAAATTCCAAATCCTGACTGGAACAATTTGTATCACCGAACCACAGTGGCCTCATGTTTGGTGCAAGGAGTTTACGCAAAGGAAAGAGACAGGGAAAACAACCGAAATGGTTCCGAGTCATTAGCCACACCTTGGTGGAAGAGTTTCAACTTCACTTTAGATGAAAGTGAAATCCTATATGACGCATTTGACGGCTCCATATACGGTGCTGTCTTCCAAAACATGATCAATTATGAGAATACCCCGAACTCGATAGTAGTACCTCCGCGTTACGTGATTGCGTTACGGGGAACTGTCCCAAGTGATGTGAGTGATTGGATACATAACAGCCGTATTGTACTCGAGAAACTCCATGGCGGGGGTAAGCATATGCATGTCATTAGAAAAATCTATTCTTTGGTGGCCAAACACGGAAACACAGCTGTCTGGATCGCTGGACACTCTTTAGGAGCTGGCCTGGCACTACTCGCGGGAAAGGACATGGCCATGTCTGGACTCCCTGTTGAGGCTTACATCTTCAACCCACCTATCTCCTTGATTCCTCTAGAGCAGTGCGGTTACAATCACGAACTTAATTTTGTGTATCGACTCACCAGGGATCTCTTCAAAGCTGGCATAGCCAAAGTCGTAGACCTTGATGAGGGTCAAGAGGGTCCACGATATAAGAACTTAGCTTCTTGGAGACCTCATTTGTTTGTGAACCAATCTGATGTAATATGCTCAGAATATATTGGTTATTTCAATCACGTAGTCACTATGACGGAGGCGGGACTCGGTGAGATTTCGAGGTTGGCTAGTGGATACTCAGTTAGGCGTATGTTATTCGGAGACGGAGAAAATTGGTCCTCGTCTTCTACACCAGATCATCTTCATTTTCTTCCGTCGGCCTTTATGATTGTAAACAAGACTGAAGCGTCGGAGTTTTATAATAAACATGGGATTCATCAATGGTGGAATCATATGCTTAAACAATCTACAACGTTTAGTTCATACTAG
>AT5G53680.1
ATGTCTCACCACCACCAAAACTTCGATACAACATTCACAAAGATATACGTGGGGGGTTTGCCTTGGACAACAAGAAAGGAAGGCTTGATAAACTTCTTCAAACGTTTTGGTGAAATCATCCATGTGAACGTCGTTTGTGATAGAGAAACAGATCGATCACAAGGATATGGCTTCGTCACGTTTAAAGACGCTGAATCTGCAACAAGAGCTTGCAAGGATCCGAATCCGACTATTGAAGGACGAATAACTAATTGCAAACTCGCTTTCGTTGGTGCTAAAGTTAAACCTAACCAATCCCAACCTTCAAATTTGCCTCAATTATTACCTAGGTATGATCCGCAATATAATCCGCGGTATGATCCGATGTCTTACCAACAAAACCGTATGGCTAACAACACCAACAATGGTATTGGCAGCATCCAACTACAAACGTTGTTAACGGAGAATCGAGCAGCTCACAGGCTACGGGAACGCAGCCAGAGTTTCTTTCGACACCGGGATCTTCGCTAA
>AT4G10955.1
ATGATGATTAGTGAAAGAGATGATTTTAGTCTCACTGGACCATTACACTTAACATCTATAGATTGGGCTAATGAACATCATCGACGATCCGTAGCTGGATCTTTGGTTCAAGGAATCTATGTAGCTGAGCGTGACCGTCAGCTACAAAGAGAAGGTCCTGAGTTAGCTTTATCTCCAATATGGTCTGAGTTTTTCCATTTCCGCCTCATTCGTAAGTTTGTCGATGACGCGGATAACTCTATCTTCGGAGGAATCTATGAGTACAAACTGCCGCAACAGCTCTCTCAAACCGTCAAATCAATGGAATTTAGTCCACGTTTTGTGATTGCTTTCAGAGGAACGGTTACAAAAGTGGACTCCATTTCCCGTGACATCGAGCATGACATCCATGTTATTAGAAACGGGCTTCACACGACAACACGGTTTGAGATAGCTATCCAAGCAGTGAGAAACATTGTTGCTTCGGTTGGTGGTTCTAGTGTTTGGCTTGCTGGTCATTCTCTTGGTGCATCTATGGCATTACTTACCGGGAAAACCATCGCTAGAACCGGGTTTTTTCCTGAGTGTTTCGCATTCAATCCGCCGTTTTTGTCTGCCCCTATCGAAAAAATTAAGGATAAGAGGATTAAACATGGGATACGCATTGCAGGCAGTGTGATCACAGCTGGACTTGCTCTAGCCAAAAAAGCCACCCAACACTACAGCCAAAACGACCGTGCATTACCCGCACCTCCTGATCCATTTGAAGCTTTATCCGACTGGTTCCCGCGGCTGTATGTCAACCCTGGTGACCACTTATGCTCAGAGTATGTTGGTTACTTTGAGCACCGAAACAAGATGGAAGAAATCGGGATTGGGTTTGTAGAGCGGGTAGCGACGCAGCACTCGTTGGGCGGTATGCTGTTAGGAGGACAAGAGCCGGTACATCTGATTCCATCTTCGGTTTTGACGGTGAACTTAAGCTCCTCGAGAGATTTCAAACAAGCTCATGGGATTCATCAGTGGTGGAGGGAAGATAACAAGTTTGAGACTAAAGTTTACCAGTACAAATGA
>AT5G13590.1
ATGTCTGGAAGCCAAGAGCCTAGGATCAGACCATCTACATGGAGCTGCAGTGATATTCCAATCAAGAAGAGGAAGTACCTTGTTCAGCCGCAAATGGAAGAAGCTGTCTCCACTCAGATTCCACAACCTAATGAGCAAGGTGATACTAGGAGTGCTCATGCTGACGAAACTCAGAAAATGACTGGTCGAGAACCAACCTCTTCATTACCATCTGTTCCTGTGGGAATTTCTGGTAAAGGGAAGAGCATTGGGAACATAGTTTTTGACCAAACTAGAGTGAAATTTGAGAAGCCAAGTTCTCCAATTCACTCCAGTCCATTGGCAGGCTTTGACATCCCTTCTAGTTCTAACGTACTTGGCAGTTCAATCCATTTTCCTATGGGAAAGCTTCCTGTTGGTGCTGAACATGCTGGTCTTGTTGTCCCCTCAAATCAAACTCGGATGAAAGTAGAAAAAACTGTTCTTAAGACTCATGATATAGTCCGGAAGACAGGTGACAAGGAAACTCTCAGAGGAGAGTGTCAAACAGAAGCATCTTCTGGTGCTAAGACTGTTTCCTTACAGCTAAGTTGTAACACTAAAAACAATTCTCCATATTGGAAGAATGAAGAGCCTACAGAACTGAATTTGTCATTAAGCAAGGGAGTTTGTCCCGCTCATAACACAGATTCTACTTCTACCAAATCTGGCAACAGTGGCCTGAACAGAGAAAATTGGGATTTGAATACTACCATGGATGTTTGGGAAGATGCTCTAGATCGCACAAGTGGTGCATTCTTAAACAGTAACAGAAGTCTTCGTGACATAGAGAGATCAAGTTGTCGTGATACGACTGCTATTACAAAGTCTGTTTCTGAAAGACAGAAGGAAAGTGTAGGATTTAGTTCTCCTAAGGTGACGTTGATGCAGTTTGATAATCATGTTAATCCCACATGCTCACTTAGTCTAGGCCTCAGTTCATATCCTCCTATTGAGAAATCTCCTTCTCTACCAGCTACCACATCAGAGGCAAGAGCTGGGAATGTGTGTTCAGTGAACCTTAGGACTGTGAAGTCAGAAATCATTGAAGAGAGTGTTAGGCAGGCAACAGAGAGTACTCAAGTTTCTCCAATCGGGCTATCTATTAAAGGACTGAAACATGAGGGTATTGGCAGATTCAGCCAAGGAAATAGTCCCTCATTTGGCATTTTGAAGACAGTGGTTCCTATATCAATAAAGGCTGAGCCAAATACCTTCTCTCAATCAGAAGTTTTCAATAGGAAAGATGGAATGTTGAATCATCCCCATACCCCAATAATGCAATCAAATGAGATCCCTGATTTACCTACAAGTTCTACGCCATATCAGAAGGATAAATATTTACCTTGTTCAAATGGTATCAGCAATGCACCAATGCCCTTGAGTGGAATGACAATAATTCCAGGCGTTCAGAGTGATCCTGACTGTACATCAAAAGAAAATTCGGGCCAGAGTAGCAGTTTAGCTAATGGTAAATTACGCGAAGTGCTGAAACATGGTGGAGTTTACACGACTTATTCTGGTCATGGAGATCATAACCTCAATGCTTCAGGTGTGAATGTTACTTCCTTGACTGAAGAGAAAATACTAGATGATTGCAAGCCTTGTATATCGAAAGAACTCCCTTGTAATTCTCGTGGAACTGATGAACTTTCCAGAAATGATGAAGAGAAGATTACTTTACCTGGTAAGGAGCTAGAGGAACAGTTATACAGTTATGGGTTTGAATCAGATCGTGGTTATGATCTATCTAGAGTAATAAAGGAGCAAGTTGGCAAAAGAAATTTGTGCGATGACGGGAAGGTCCAAGGACCAGCTGCCGTTTTCACGGAAAGTAATGAGGTTGCACATCCTGAGTGTGGTGGTTCTGAAACTGAACAAAGGAACATTAATGTTCCATGCCATGTCCACTTTCATAATTCTAACCATGTGGAAGAAAAAGGGAGTCAACCTGCACTTCTTGGTTATACAGGTGAAACTGAAGGCCGGATAGTTCAGGATGGTGAAGGAACGTCAGGTGTCTCTACAGTGTCAGGCGGCATTGAAAACCCTGAAATAGTAGATAACAGTAGTCCAGTTTCACTCAAGGCAGAAATGTCTACTATTGACAATGATTCTCCTATGGAGTGCAGTGACGGTAGTCAGAGTCGAATTATAAACTTAACTCAGGTTAAATCTCCAGTTAAGGCACTAGATGCTTCAGGCAGCTTTGTGCCACCCCGAATGGAAAGAGATAGATTTCATGATTTCCCACTCGAACCGCGGGAATATACTTTCAGAGGGAGTGATGAATCCTGCAAATTCTCGCGTGAGAGGTACCATGGCAGAATTATGAGAAGCCCAAGGTTAAATTTCATACCTGACAGAAGGAGATTACCTGATAACACAGAAAGCAATCTGCATGACCAGGACACAAAAAAATTTGAGTTTGATAATCATGGAAACACTCGTCGGGGTGGTGCTTTTATGAGTAATTTTCAGAGAGGGAGACGGCCTGCAAATGATGGAGTTACACCATATGCTCACTCCTTTCCGAGAAGATCCCCTAGCTTTTCATATAATAGAGGACCAACAAATAAAGAGGATACATCTGCATTTCACGGATTTAGAGATGGTGAAAAATTCACAAGGGGATTACAATGCAACAACACAGAACCACTGTTTATGAATCACCAACGTCCATATCGAGGTCGGAGTGGTTTTGCTCGAGGACGAACAAAGTTTGTAAACAACCCCAAACGAGATTTTCCTGGATTTCGTTCACGATCTCCAGTTAGATCAAGAGAAAGATCAGATGGTTCATCCTCGTCTTTCAGGAATAGATCACAGGAAGAGTTCAGTGGGCATACAGACTTTTCTCATCGAAGATCACCCTCAGGTTACAAAGTGGAGAGGATGAGCTCGCCTGACCATTCTGGTTATTCAAGAGAAATGGTTGTCAGAAGACACAATTCTCCACCTTTCTCGCATAGACCATCGAATGCTGGAAGGGGCCGGGGTTATGCAAGGGGCCGAGGTTATGTAAGGGGTCGAGGTTATGGAAGAGATGGCAACTCATTTAGGAAACCATCTGATCATGTTGTACATAGAAACCATGGAAACATGAATAACTTGGATCCTCGAGAAAGGGTTGACTATAGTGATGATTTCTTTGAAGGTCAAATTCATTCTGAACGATTTGGTGTTGATGTTAATGCTGAGAGAAGACGATTTGGTTATAGACATGATGGTACCAGCAGCTCTTTTAGACCATCTTTTAACAATGATGGTTGTGCACCTACTAATGTAGAGAATGACCCTGATGCTGTGAGGTTCCAACAAGACCCTCGTATTAAAATTGAAGAACAAGGGAGTTTAATGGAAATTGATGGAGAAAATAAGAACTCAACTGAGAATGCATCTGGAAGAACTAAGAATATGGAAGAGGAAGAAACTTCAAAGAACAGTAAAATTTGGCAACCGGATGAGCTCGGTGGTGATGGTTTTTAA
>AT5G13590.2
ATGTCTGGAAGCCAAGAGCCTAGGATCAGACCATCTACATGGAGCTGCAGTGATATTCCAATCAAGAAGAGGAAGTACCTTGTTCAGCCGCAAATGGAAGAAGCTGTCTCCACTCAGATTCCACAACCTAATGAGCAAGGTGATACTAGGAGTGCTCATGCTGACGAAACTCAGAAAATGACTGGTCGAGAACCAACCTCTTCATTACCATCTGTTCCTGTGGGAATTTCTGGTAAAGGGAAGAGCATTGGGAACATAGTTTTTGACCAAACTAGAGTGAAATTTGAGAAGCCAAGTTCTCCAATTCACTCCAGTCCATTGGCAGGCTTTGACATCCCTTCTAGTTCTAACGTACTTGGCAGTTCAATCCATTTTCCTATGGGAAAGCTTCCTGTTGGTGCTGAACATGCTGGTCTTGTTGTCCCCTCAAATCAAACTCGGATGAAAGTAGAAAAAACTGTTCTTAAGACTCATGATATAGTCCGGAAGACAGGTGACAAGGAAACTCTCAGAGGAGAGTGTCAAACAGAAGCATCTTCTGGTGCTAAGACTGTTTCCTTACAGCTAAGTTGTAACACTAAAAACAATTCTCCATATTGGAAGAATGAAGAGCCTACAGAACTGAATTTGTCATTAAGCAAGGGAGTTTGTCCCGCTCATAACACAGATTCTACTTCTACCAAATCTGGCAACAGTGGCCTGAACAGAGAAAATTGGGATTTGAATACTACCATGGATGTTTGGGAAGATGCTCTAGATCGCACAAGTGGTGCATTCTTAAACAGTAACAGAAGTCTTCGTGACATAGAGAGATCAAGTTGTCGTGATACGACTGCTATTACAAAGTCTGTTTCTGAAAGACAGAAGGAAAGTGTAGGATTTAGTTCTCCTAAGGTGACGTTGATGCAGTTTGATAATCATGTTAATCCCACATGCTCACTTAGTCTAGGCCTCAGTTCATATCCTCCTATTGAGAAATCTCCTTCTCTACCAGCTACCACATCAGAGGCAAGAGCTGGGAATGTGTGTTCAGTGAACCTTAGGACTGTGAAGTCAGAAATCATTGAAGAGAGTGTTAGGCAGGCAACAGAGAGTACTCAAGTTTCTCCAATCGGGCTATCTATTAAAGGACTGAAACATGAGGGTATTGGCAGATTCAGCCAAGGAAATAGTCCCTCATTTGGCATTTTGAAGACAGTGGTTCCTATATCAATAAAGGCTGAGCCAAATACCTTCTCTCAATCAGAAGTTTTCAATAGGAAAGATGGAATGTTGAATCATCCCCATACCCCAATAATGCAATCAAATGAGATCCCTGATTTACCTACAAGTTCTACGCCATATCAGAAGGATAAATATTTACCTTGTTCAAATGGTATCAGCAATGCACCAATGCCCTTGAGTGGAATGACAATAATTCCAGGCGTTCAGAGTGATCCTGACTGTACATCAAAAGAAAATTCGGGCCAGAGTAGCAGTTTAGCTAATGGTAAATTACGCGAAGTGCTGAAACATGGTGGAGTTTACACGACTTATTCTGGTCATGGAGATCATAACCTCAATGCTTCAGGTGTGAATGTTACTTCCTTGACTGAAGAGAAAATACTAGATGATTGCAAGCCTTGTATATCGAAAGAACTCCCTTGTAATTCTCGTGGAACTGATGAACTTTCCAGAAATGATGAAGAGAAGATTACTTTACCTGGTAAGGAGCTAGAGGAACAGTTATACAGTTATGGGTTTGAATCAGATCGTGGTTATGATCTATCTAGAGTAATAAAGGAGCAAGTTGGCAAAAGAAATTTGTGCGATGACGGGAAGGTCCAAGGACCAGCTGCCGTTTTCACGGAAAGTAATGAGGTTGCACATCCTGAGTGTGGTGGTTCTGAAACTGAACAAAGGAACATTAATGTTCCATGCCATGTCCACTTTCATAATTCTAACCATGTGGAAGAAAAAGGGAGTCAACCTGCACTTCTTGGTTATACAGGTGAAACTGAAGGCCGGATAGTTCAGGATGGTGAAGGAACGTCAGGTGTCTCTACAGTGTCAGGCGGCATTGAAAACCCTGAAATAGTAGATAACAGTAGTCCAGTTTCACTCAAGGCAGAAATGTCTACTATTGACAATGATTCTCCTATGGAGTGCAGTGACGGTAGTCAGAGTCGAATTATAAACTTAACTCAGGTTAAATCTCCAGTTAAGGCACTAGATGCTTCAGGCAGCTTTGTGCCACCCCGAATGGAAAGAGATAGATTTCATGATTTCCCACTCGAACCGCGGGAATATACTTTCAGAGGGAGTGATGAATCCTGCAAATTCTCGCGTGAGAGGTACCATGGCAGAATTATGAGAAGCCCAAGGTTAAATTTCATACCTGACAGAAGGAGATTACCTGATAACACAGAAAGCAATCTGCATGACCAGGACACAAAAAAATTTGAGTTTGATAATCATGGAAACACTCGTCGGGGTGGTGCTTTTATGAGTAATTTTCAGAGAGGGAGACGGCCTGCAAATGATGGAGTTACACCATATGCTCACTCCTTTCCGAGAAGATCCCCTAGCTTTTCATATAATAGAGGACCAACAAATAAAGAGGATACATCTGCATTTCACGGATTTAGAGATGGTGAAAAATTCACAAGGGGATTACAATGCAACAACACAGAACCACTGTTTATGAATCACCAACGTCCATATCGAGGTCGGAGTGGTTTTGCTCGAGGACGAACAAAGTTTGTAAACAACCCCAAACGAGATTTTCCTGGATTTCGTTCACGATCTCCAGTTAGATCAAGAGAAAGATCAGATGGTTCATCCTCGTCTTTCAGGAATAGATCACAGGAAGAGTTCAGTGGGCATACAGACTTTTCTCATCGAAGATCACCCTCAGGTTACAAAGTGGAGAGGATGAGCTCGCCTGACCATTCTGGTTATTCAAGAGAAATGGTTGTCAGAAGACACAATTCTCCACCTTTCTCGCATAGACCATCGAATGCTGGAAGGGGCCGGGGTTATGCAAGGGGCCGAGGTTATGTAAGGGGTCGAGGTTATGGAAGAGATGGCAACTCATTTAGGAAACCATCTGATCATGTTGTACATAGAAACCATGGAAACATGAATAACTTGGATCCTCGAGAAAGGGTTGACTATAGTGATGATTTCTTTGAAGGTCAAATTCATTCTGAACGATTTGGTGTTGATGTTAATGCTGAGAGAAGACGATTTGGTTATAGACATGATGGTACCAGCAGCTCTTTTAGACCATCTTTTAACAATGATGGTTGTGCACCTACTAATGTAGAGAATGACCCTGATGCTGTGAGGTTCCAACAAGACCCTCGTATTAAAATTGAAGAACAAGGGAGTTTAATGGAAATTGATGGAGAAAATAAGAACTCAACTGAGAATGCATCTGGAAGAACTAAGAATATGGAAGAGGAAGAAACTTCAAAGAACAGTAAAATTTGGCAACCGGATGAGCTCGGTGGTGATGGTTTTTAA
>AT5G24230.3
ATGGAAGAAGAAGATGATGAGGTTATGGTCAGAGAGGGGCTAATGGCCTCTCAAAGAGAAATCTTCAGCATTTCTGGTCCAATCCATTTAACTTCCATTGATTGGAATAATTCTTATCATAGAACCTCGGTGGCATCATGTTTGGTACAAGCAGTGTACACATTGGAACGAGACAGACAACAAAACAGGATTGGCCTAAAGTCACAAGCCAATCATTGGTGGGAGTTTTTCAACTTCACTTTAGCCGAAACCCTAATCGACGACTCAGACGGATCTATATACGGCGCCGTTTTCGAATACAAACATTTCTTCTCCTACAATTACCATCACACCCCTCATTCGAAACCACCTCCTCGTCACGTGATTGCTTTCCGTGGCACGATCTTGAAACCGCACTCTCGGTCACGTGACCTTAAGCTCGACCTACGTTGCATCCGAGACTCTCTCCATGATAGCACTCGGTTCGTGCATGCCATTCAGGTTATTCAAAGTGCGGTGGCTAAAACTGGTAATGCAGCCGTGTGGCTCGCCGGACATTCTCTTGGAGCAGCCGTGGCTTTGCTTGCCGGGAAGATTATGACAAGGTCTGGTTTTCCTCTTGAGAGTTACTTATTCAATCCTCCTTTCTCGTCTATTCCGATAGAGAAGCTAGTGAAGAGTGAGAAGCTTAAACATGGGGTTCGATTCGCCGGAAGTCTTGTTAAAGCCGGAGTTGCCATCGCCGTTAAGGGTCGCCACCATAATAAGGGTCAAGAAGACGATTCGTTCATGAAGTTAGCATCATGGATACCATATTTGTATTTGAATCCGTTAGATACAATATGCTCAGAATACATTGGTTACTTCAAGCACAGAAACAAAATGTTTGAGATCGGAGCCGGTAAAATCGAAAGAATTGCTACGAGGAACTCACTTAGGAGTCTGTTGTCAGGAGGAGGAGGAGGAGGTTCATCTTCAGATTCTTCTTCAGAGCCTCTTCATCTTTTACCATCGGCATATATGACGATAAACGCTAGCAAATCGCCGAATTTTAAGAGAGCTCATGGGATTCATCAATGGTGGGATCCCATGTTTAATGGTGAATATGTTTTGCATCAGTTTAATAACTAA

command: wgd -v debug ksd sample.mcl sample.fasta --wm phyml -o sample.out

output:

    AlignmentCoverage   AlignmentIdentity   AlignmentLength AlignmentLengthStripped Distance    Family  Ka  Ks  Node    Omega   Paralog1    Paralog2    WeightOutliersIncluded  WeightOutliersExcluded
AT5G13590.1__AT5G13590.2    1.0 1.0 3504.0  3504.0  0.0 GF_000003   0.0 0.0 2.0 0.001   AT5G13590.1 AT5G13590.2 1.0 0.0
AT1G01080.3__AT5G53680.1    0.58537 0.38889 861.0   504.0   190.80626   GF_000002   0.9774  67.4602 2.0 0.0145  AT1G01080.3 AT5G53680.1 1.0 0.0
AT4G10955.1__AT5G24230.3    0.90237 0.577   1137.0  1026.0  1.10504 GF_000001   0.4267  2.1848  3.0 0.1953  AT5G24230.3 AT4G10955.1 1.0 1.0
AT5G24210.1__AT5G24230.3    0.91557 0.6196  1137.0  1041.0  1.31704 GF_000001   0.422   1.1156  4.0 0.3782  AT5G24210.1 AT5G24230.3 0.5 0.5
AT4G10955.1__AT5G24210.1    0.88127 0.48004 1137.0  1002.0  2.04701 GF_000001   0.6653  4.4183  4.0 0.1506  AT5G24210.1 AT4G10955.1 0.5 0.5

Can you verify if this works for you?

Note: The ALC (average linkage clustering) is of course not necessary, but it's a small hack to get the results in the same data structure with no overhead

BiodivGenomic commented 5 years ago

Hi, I got an output, but shorter than what you showed...

sample.fasta.ks.txt

I continue to explore my issue, and it seems to be more related to families with 3 members, when one is filtered, resulting in 2 sequences to be treated by PhyML (and thus after the ALC filter to be applied)...

For example, below is the part of the debug concerning a family with 3 members : results_3genes_family.txt

As I remember it was not this family that failed in the previous analysis (maybe the first step is not using exactly the same order each time ?), it seems difficult to just remove the problematic families...

arzwa commented 5 years ago

Whoops, you are correct, my output was from a different test, I edited it.

I see you are running the analysis with the --pairwise flag, that's not really necessary and I would suggest not using it (it's slower, and does not matter for the results, it's there for compatibility reasons with earlier analyses). I think the bug you found might be alleviated when using the family-wise approach (without specfying the --pairwise flag). Can you test whether it works without the --pairwise flag?

BiodivGenomic commented 5 years ago

OK, I'm testing without the --pairwise flag.

BiodivGenomic commented 5 years ago

Hi, the --pariwise flag doesn't seem to solve the issue... I ran the ksd step using different genomes, with different characteristics (number of genes, the way the genes are named, etc), and the 3 are stalled, each at a family with 3 genes (using multiprocessing, thus maybe the same than in the other issue apply...). Two are on my computer, the third one is on a distant cluster, thus I don't think it's computer-specific. The 3 are stalled ot the phyml step ("DEBUG Running PhyML: phyml -i ..."). the analysis is going faster with the reading of the mcl file (make sense, as the families are smaller and smaller), thus I don't think it's "normal" to have the phyml step for a family running for more than 2 days... I stopped one and ran it again, and it finished (I already restart several times the other, without improvement).

arzwa commented 5 years ago

No that definitely does not sound good, and I haven't had these problems before myself. I will need some small example data set to try to reproduce this issue and do he debugging. Could you provide me a small set of families + sequences where you have this issue? Thanks

BiodivGenomic commented 5 years ago

I checked the analyses that is running on my computer, and the last family entirely treated (with a fasta, fasta.msa and .ks files) is the last family with 3 members in the mcl file. Also, I'm running them now using fasttree instead of phyml, and the first one completed...

arzwa commented 5 years ago

Hi, may I ask if you still have issues related to the above? If not I'd like to close this, if you do have issues I'd like to solve them.

arzwa commented 5 years ago

closed due to no follow-up