TheBrownLab / PhyloFisher

PhyloFisher is a software package written in Python3 that can be used for the creation, analysis, and visualization of phylogenomic datasets that consist of eukaryotic protein sequences.
MIT License
31 stars 15 forks source link

metadata not defined in forest_local.py #120

Closed sipesk closed 4 months ago

sipesk commented 5 months ago

Hello,

Phylofisher is great and was running smoothly until i brought the sgt_construct_out.tar.gz to my local machine.

I downloaded forest_local.py and now get an error with the metadata args. I've checked the .tar.gz to make sure than the metadata.tsv are there and contain text

$ python3 forest_local.py -i sgt_constructor_out_Apr.24.2024-local.tar.gz -t 10

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.10/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/python@3.10/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Users/au706677/Documents/AU/DeepPurple/Cryobio/Leftovers/EUKBINS/phylofisher/forest_local.py", line 197, in suspicious_clades
    groups.add(metadata[org]['Higher Taxonomy'])
NameError: name 'metadata' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/au706677/Documents/AU/DeepPurple/Cryobio/Leftovers/EUKBINS/phylofisher/forest_local.py", line 684, in <module>
    suspicious = parallel_susp_clades(trees)
  File "/Users/au706677/Documents/AU/DeepPurple/Cryobio/Leftovers/EUKBINS/phylofisher/forest_local.py", line 503, in parallel_susp_clades
    suspicious = list(pool.map(suspicious_clades, trees))
  File "/usr/local/Cellar/python@3.10/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/Cellar/python@3.10/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
NameError: name 'metadata' is not defined
robert-ervin-jones commented 5 months ago

Hi @sipesk,

Can you try re-running without the -t option? There may be an issue with the parallelization.

Let me know if that works or not.

Thanks for using PhyloFisher!

Best, Robert

sipesk commented 5 months ago

No dice. I tried both python and python3 as well.

(fisher) au706677@d46989 phylofisher % python3 forest_local.py -i sgt_constructor_out_Apr.24.2024-local.tar.gz
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/Cellar/python@3.10/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/python@3.10/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/Users/au706677/Documents/AU/DeepPurple/Cryobio/Leftovers/EUKBINS/phylofisher/forest_local.py", line 197, in suspicious_clades
    groups.add(metadata[org]['Higher Taxonomy'])
NameError: name 'metadata' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/au706677/Documents/AU/DeepPurple/Cryobio/Leftovers/EUKBINS/phylofisher/forest_local.py", line 684, in <module>
    suspicious = parallel_susp_clades(trees)
  File "/Users/au706677/Documents/AU/DeepPurple/Cryobio/Leftovers/EUKBINS/phylofisher/forest_local.py", line 503, in parallel_susp_clades
    suspicious = list(pool.map(suspicious_clades, trees))
  File "/usr/local/Cellar/python@3.10/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/Cellar/python@3.10/3.10.14/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
NameError: name 'metadata' is not defined

Dir contains metadata.tsv and its a non zero content file.

Screenshot 2024-05-01 at 12 16 02
shuiyujinlan commented 5 months ago

Hi, @robert-ervin-jones . I have the same question in my test with the original "metadata.tsv". How to bypass multiprocessing? And my test in remote server produce no result(except for the empty dir "forest_out_M.D.Y" itself).

Local test error info as below:

python forest_local.py -i sgt_constructor_out_Apr.28.2024-local.tar.gz
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "D:\2024\05\forest_local.py", line 197, in suspicious_clades
    groups.add(metadata[org]['Higher Taxonomy'])
NameError: name 'metadata' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "forest_local.py", line 684, in <module>
    suspicious = parallel_susp_clades(trees)
  File "forest_local.py", line 503, in parallel_susp_clades
    suspicious = list(pool.map(suspicious_clades, trees))
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
NameError: name 'metadata' is not defined

After I add "metadata = {}" in line 20, the info changed as below:

>python forest_local.py -i sgt_constructor_out_Apr.28.2024-local.tar.gz
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "D:\2024\05\forest_local.py", line 198, in suspicious_clades
    groups.add(metadata[org]['Higher Taxonomy'])
KeyError: 'Tisolute'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "forest_local.py", line 671, in <module>
    suspicious = parallel_susp_clades(trees)
  File "forest_local.py", line 504, in parallel_susp_clades
    suspicious = list(pool.map(suspicious_clades, trees))
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
KeyError: 'Tisolute'

Then I commented out the function "def parallel_susp_clades(trees)", and changed "suspicious = parallel_susp_clades(trees)" to "suspicious = suspicious_clades(trees)", the error info changed as below:

>python forest_local.py -i sgt_constructor_out_Apr.28.2024-local.tar.gz
Traceback (most recent call last):
  File "forest_local.py", line 672, in <module>
    suspicious = suspicious_clades(trees)
  File "forest_local.py", line 175, in suspicious_clades
    t = Tree(tree)
  File "C:\Program Files\Python38\lib\site-packages\ete3\coretype\tree.py", line 212, in __init__
    read_newick(newick, root_node = self, format=format,
  File "C:\Program Files\Python38\lib\site-packages\ete3\parser\newick.py", line 269, in read_newick
    raise NewickError("'newick' argument must be either a filename or a newick string.")
ete3.parser.newick.NewickError: 'newick' argument must be either a filename or a newick string.
You may want to check other newick loading flags like 'format' or 'quoted_node_names'.
shuiyujinlan commented 5 months ago

Hi, @robert-ervin-jones . I have the same question in my test with the original "metadata.tsv". How to bypass multiprocessing? And my test in remote server produce no result(except for the empty dir "forest_out_M.D.Y" itself).

Local test error info as below:

python forest_local.py -i sgt_constructor_out_Apr.28.2024-local.tar.gz
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "D:\2024\05\forest_local.py", line 197, in suspicious_clades
    groups.add(metadata[org]['Higher Taxonomy'])
NameError: name 'metadata' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "forest_local.py", line 684, in <module>
    suspicious = parallel_susp_clades(trees)
  File "forest_local.py", line 503, in parallel_susp_clades
    suspicious = list(pool.map(suspicious_clades, trees))
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
NameError: name 'metadata' is not defined

After I add "metadata = {}" in line 20, the info changed as below:

>python forest_local.py -i sgt_constructor_out_Apr.28.2024-local.tar.gz
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "D:\2024\05\forest_local.py", line 198, in suspicious_clades
    groups.add(metadata[org]['Higher Taxonomy'])
KeyError: 'Tisolute'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "forest_local.py", line 671, in <module>
    suspicious = parallel_susp_clades(trees)
  File "forest_local.py", line 504, in parallel_susp_clades
    suspicious = list(pool.map(suspicious_clades, trees))
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Program Files\Python38\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
KeyError: 'Tisolute'

Then I commented out the function "def parallel_susp_clades(trees)", and changed "suspicious = parallel_susp_clades(trees)" to "suspicious = suspicious_clades(trees)", the error info changed as below:

>python forest_local.py -i sgt_constructor_out_Apr.28.2024-local.tar.gz
Traceback (most recent call last):
  File "forest_local.py", line 672, in <module>
    suspicious = suspicious_clades(trees)
  File "forest_local.py", line 175, in suspicious_clades
    t = Tree(tree)
  File "C:\Program Files\Python38\lib\site-packages\ete3\coretype\tree.py", line 212, in __init__
    read_newick(newick, root_node = self, format=format,
  File "C:\Program Files\Python38\lib\site-packages\ete3\parser\newick.py", line 269, in read_newick
    raise NewickError("'newick' argument must be either a filename or a newick string.")
ete3.parser.newick.NewickError: 'newick' argument must be either a filename or a newick string.
You may want to check other newick loading flags like 'format' or 'quoted_node_names'.

Solved in a silly way: I successfully changed multiprocessing to a simple "for loop", and then the result came out smoothly in 1 min. To achieve this, change:

    if not args.backpropagate:
        suspicious = parallel_susp_clades(trees)

to

    suspicious = []

    if not args.backpropagate:
#        suspicious = parallel_susp_clades(trees)
        for tree in trees:
            suspicious.append(suspicious_clades(tree))
        print(suspicious)

You will see the list of suspicious genes in the corresponding tree in your terminal. Result files are as below:

image

Hope this helps! @robert-ervin-jones @sipesk

robert-ervin-jones commented 5 months ago

Hi @shuiyujinlan,

Would it be possible for you to open a PR with your proposed code changes?

Best, Robert

maggielawton commented 5 months ago

Had the same issue, and the fix by shuiyujinlan worked for me as well. Thanks!

shuiyujinlan commented 5 months ago

Hi @shuiyujinlan,

Would it be possible for you to open a PR with your proposed code changes?

Best, Robert

Sure. I opened a PR just now. And hope you'll find some clues in my reply and description to fix it more gracefully (e.g. retain the multiprocessing function).