arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
81 stars 41 forks source link

Issue during MSA (second step) #8

Closed BiodivGenomic closed 5 years ago

BiodivGenomic commented 5 years ago

Hi, I ran the first step without trouble (Get the paranome), but the second failed on the cluster I use : with this command : wgd --verbosity debug ksd -o ./ -n 12 --pairwise --wm fasttree peptides_cleaned.fa.blast.tsv.mcl ../peptides_cleaned.fa I got this error :

2019-01-31 19:38:30: INFO   Performing analysis on gene family GF_003161
2019-01-31 19:38:30: DEBUG  Performing MSA (muscle) for GF_003161
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/python-3.6.5/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'evm.model.scaf_173.199'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/python-3.6.5/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 350, in __call__
    return self.func(*args, **kwargs)
  File "/usr/local/python-3.6.5/lib/python3.6/site-packages/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/python-3.6.5/lib/python3.6/site-packages/joblib/parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/usr/local/wgd-1.0/venv/lib/python3.6/site-packages/wgd/ks_distribution.py", line 425, in analyse_family_pairwise
    'Ks': results_dict['Ks'][g1][g2],
  File "/usr/local/python-3.6.5/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/python-3.6.5/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/python-3.6.5/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/python-3.6.5/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
    loc = self.items.get_loc(item)
  File "/usr/local/python-3.6.5/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'evm.model.scaf_173.199'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/python-3.6.5.1/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/python-3.6.5/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 359, in __call__
    raise TransportableException(text, e_type)
joblib.my_exceptions.TransportableException: TransportableException
___________________________________________________________________________
KeyError                                           Thu Jan 31 19:23:15 2019
PID: 31068              Python 3.6.5: /usr/local/wgd-1.0/venv/bin/python3.6
...........................................................................
/usr/local/python-3.6.5/lib/python3.6/site-packages/joblib/parallel.py in __call__(self=<joblib.parallel.BatchedCalls object>)
    126     def __init__(self, iterator_slice):
    127         self.items = list(iterator_slice)
    128         self._size = len(self.items)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
        self.items = [(<function analyse_family_pairwise>, ('GF_000136', {'evm.model.scaf_117.352': 'MAAASLIFSPCLLLLLFLVSSPSLSARVALLSEEQRQHKQPPLFTHVC...SFSINACKALSMVTESAYVVLPWGTHSISVGDGEKAVTFPVHVSYEFSA', 'evm.model.scaf_147.358': 'MVPAALFSTQLRSALTPSPACLIPGTTNNGKMTSFAFVLLLLLFCIAP...VISPCEHLSKTEEDGSKVLEGGSHFLVVGDEEYQVNIVSSKRNEWSSLV', 'evm.model.scaf_173.199': 'MGGRLPITWYYNDYVKHIPMTSIQLRPDLANKYPARTYKFFDGSVVYP...APTAKFSLRSLRGIEPCHRRCLHLSASTLSLNDGALTSPFSLCFKRLKK', 'evm.model.scaf_173.201': 'MQITGPPKIFRGKSIPCRYSKPVDACGKYTKVNFMRGCNGAKCPHPSWVANAVKTSQTLDATISHVGLDLSTEAEGLDRTDLLLPGF', 'evm.model.scaf_173.202': 'MVNSHRFLGNAPEDAVRQVLSAGLDLDCGDYYRNYALITTNKGNVDNA...DSLGKEDFCSDEHMELATEAARQGTVLLKNDHNTLPLDACNLKSVAAVS', 'evm.model.scaf_173.203': 'MSKTSQLYGLRERGPGLDLDCGDFYPKYLKSAVEQGKVREGDIDKALI...FSINACEALGLVTETAYKVLPWGRHTISIGDGSGAITFPLQVNFKFFSN', 'evm.model.scaf_212.169': 'MAALDLLLLVCISLLIISTSSRTIQPVRRSYPRRGIQTLGMNATNFNH...SKTIPYDLNICESLKVVTGSAYTVVPYGQHTITVGDGDGSISFSFEVKF', 'evm.model.scaf_226.59': 'MAAAVRLISLVLLFSLLSILFSQAQSRPAFACGGGSARTFPFCQTSLP...ARVTVGLDVCKHLSFVDEQGIRRIPIGDHSLHVGDLTHSLSLHVEGTGI', 'evm.model.scaf_244.127': 'MSKTPLLSVLLLLFLLVSPSASHPHRPFACLGPESSLPFCNAALSIPD...VSNVGPGTRFGGKFPAATSFPQVILTAAAFNASLWEEIGRVRSLLSSRR', 'evm.model.scaf_244.128': 'MYNKGWGGLTYWSPNVNIFRDPRWGRGQETPGEDPVVAGKYAASYVRG...QTRVAVNIHVCKHLSVVDTSGIRRIPIGDHSLQVGDLTHSISLLGETLP', ...}, {'evm.model.scaf_1.1': 'ATGTCTTTGAATTTTAATAATTCCTCCAGCACAAAGGATCACTTCCAG...ATATGGATTGCAAGCTGCAGTGGGTGCCATGCTGTCTCCATTATTGTGA', 'evm.model.scaf_1.14': 'ATGCTAGCGAACACCTCGATAAGGGTGCTGTACAAATCTCCCTTCAGA...AGAAGAAGATGAAGAAGGTGAGGGAGATAGAGTAGAAACAAACAAGTAG', 'evm.model.scaf_1.15': 'ATGGACGCTGCTGCGTCTCTGCTTCGCCCATATTCCATCCTCCGGCTG...CTACTGCTTCTTCCAGGGTGCTGGCGACTCGCTAAAATATTCTCGTTGA', 'evm.model.scaf_1.16': 'ATGGTGGATGGCAAAAGAATCGCCCTCGATTTTTGGGGGTTCATCTGG...AGCAGAGGATGAAGCATTGAAAGAAGACTGGCAGAGGATGAAGCATTGA', 'evm.model.scaf_1.19.1': 'ATGTCCGTCAGCGAAATCGCCTGCACCTACGCCGCCCTGCTTCTATAC...TCCTGGTTTGTTGTATTGTCGTGCAATTAACGATGAAACTTGTCGTTGA', 'evm.model.scaf_1.2': 'ATGATACCAAATATCTTGAGCACATGCCTTCTGTTGGTTCATCGACAA...CACAACAGCAGAGGATGAAGCATTGCAAGCTGTGGATGGGGAGCGCTAG', 'evm.model.scaf_1.22': 'ATGAGAGCCGGGACTAGGTCAATTCGGCTTCAATCTTCCATTCAGGGA...AGGAGAAGAATACTTATTTCCAATGGAAAACCATTTTACAGAAGTATGA', 'evm.model.scaf_1.23': 'ATGGATTTGCTGCAAAATTACTCTGCAAAGAGCGATTCCTCTGATGGC...GAGCAAAGTTGCAACGTGTGGTTGGGACGGTTTGATCAAATACTGGTAA', 'evm.model.scaf_1.27': 'ATGGCAGCTGTAACTGCCTCCCTCTTTGTATCAAGAAGCAACAATTTG...TGGACATGAACTTGCTCCACTTTCAGTGGATAATGTGGCTAATCCTTGA', 'evm.model.scaf_1.28': 'ATGGCAACTGTAAGCGCCTCCCACTTTGTATCAAGAAGTTTCCATTTC...TCATTTTCTCCTTTTCTGTCAAAAATGTCGTGATATTTTTCTTGTATAA', ...}, '/tmp/hinsinger/data_comparative_genomics/paranome/ks_tmp.3707af5a12e8b8', 'codeml', False, 1, 100, 'fasttree', 'muscle', '/tmp/hinsinger/data_comparative_genomics/paranome'), {})]
    132 
    133     def __len__(self):
    134         return self._size
    135 

Then other partial alignments and more messages... How can I fix it ? IT doesn't seem related to the cluster settings, but to the multiprocessing in WGD.... Thanks a lot in advance !

arzwa commented 5 years ago

Hi,

I (and others) have never had problems in the MSA part before, so it's a bit puzzling to me. It's a bit hard to tell from this output where the problem is exactly. First guess would be (based on the KeyError: 'evm.model.scaf_173.199') that the gene evm.model.scaf_173.199 is missing from your fasta file, however that's probably not the case since I guess you use the same file as the one you used for wgd mcl. Another possibility is that the translation failed for some reason, what does the sequence ofevm.model.scaf_173.199 look like? Could you maybe post all the sequences for GF_003161 so I can have a look (just the sequences for all genes on the 3161th line of the mcl output file)?

Best, Arthur

BiodivGenomic commented 5 years ago

Hi, I attached the sequences of the Gene Family_003161. BTW, it was at the line 122 of the MCL output (peptides_cleaned.fa.blast.tsv.mcl)... I don't know if that matters ?

seqs_GF_003161.txt

arzwa commented 5 years ago

Hi, there seems to be no issue with this family. The traceback when wgd crashes in a parallel part can be very long, and it can be hard to find the relevant section, could you maybe send me the full traceback? If I would still be unable to find the bug or data issue I will probably need (a part of) you're data set to have a closer look.

arzwa commented 5 years ago

BTW, I got confused: since wgd runs in parallel, the line 2019-01-31 19:38:30: DEBUG Performing MSA (muscle) for GF_003161 has probably nothing to do with the actual error that caused wgd to crash, which probably happened in a gene family that wgd started to analyse a while before it came to GF_003161. I guess the actual problem is with family on line 122 which should be GF000122.