Robaina / MetaTag

metaTag: functional and taxonomical annotation of metagenomes through phylogenetic tree placement
https://robaina.github.io/MetaTag/
Apache License 2.0
1 stars 0 forks source link

labelplacement error in gappa #66

Closed gecko1990 closed 2 years ago

gecko1990 commented 2 years ago

Hi, I am trying to apply labelplacement on my amt tree but I keep getting an error related to gappa:

(traits) rlaso@elbrus:/data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/results/Arctic$ python /data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py   --jplace epa_result_modified.jplace   --labels /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/reference_data_Amt_TIGR00836/ref_amtref_database_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/McDonald_2016/mcdonald2016_prokaryotes_processed_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/McDonald_2016/mcdonald2016_plant_processed_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/McDonald_2016/mcdonald2016_rh_processed_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/reviewed_sequences/SwissProt_Amt_prokaryotes_processed_id_dict.pickle   --ref_clusters ../cluster_id.tsv   --ref_cluster_scores ../cluster_score.tsv   --prefix arctic
                                              ....      ....
                                             '' '||.   .||'
                                                  ||  ||
                                                  '|.|'
     ...'   ....   ... ...  ... ...   ....        .|'|.
    |  ||  '' .||   ||'  ||  ||'  || '' .||      .|'  ||
     |''   .|' ||   ||    |  ||    | .|' ||     .|'|.  ||
    '....  '|..'|'. ||...'   ||...'  '|..'|.    '||'    ||:.
    '....'          ||       ||
                   ''''     ''''   v0.7.1 (c) 2017-2021
                                   by Lucas Czech and Pierre Barbera

Invocation:                        gappa examine assign --jplace-path epa_result_modified.jplace --taxon-file temp_oxulnwpxnw --out-dir
                                   /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/results/Arctic --file-prefix arctic --allow-file-overwriting
                                   --per-query-results --best-hit
Command:                           gappa examine assign

Input:
  --jplace-path                    epa_result_modified.jplace
  --taxon-file                     temp_oxulnwpxnw
  --root-outgroup
  --taxonomy
  --ranks-string                   superkingdom|phylum|class|order|family|genus|species

Settings:
  --sub-taxopath
  --max-level                      0
  --distribution-ratio             -1
  --consensus-thresh               1
  --resolve-missing-paths          false

Output:
  --out-dir                        /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/results/Arctic
  --file-prefix                    arctic
  --file-suffix
  --cami                           false
  --sample-id
  --krona                          false
  --sativa                         false
  --per-query-results              true
  --best-hit                       true

Global Options:
  --allow-file-overwriting         true
  --verbose                        false
  --threads                        52
  --log-file

Run the following command to get the references that need to be cited:
`gappa tools citation Czech2020-genesis-and-gappa`

Started 2022-03-29 21:09:28

Found 1 jplace file
Running the assignment
Not all leafs in the reference tree were taxonomically labelled!(1000 / 1164)
Please check tree leaf label and taxon file taxa name congruency!
Segmentation fault (core dumped)
Traceback (most recent call last):
  File "/data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py", line 115, in <module>
    main()
  File "/data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py", line 100, in main
    assignLabelsToPlacements(
  File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/placement.py", line 413, in assignLabelsToPlacements
    parseGappaAssignTable(
  File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/placement.py", line 298, in parseGappaAssignTable
    table = pd.read_csv(input_table, sep='\t')
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
    self._engine = self._make_engine(self.engine)
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
    self._reader = parsers.TextReader(self.handles.handle, **kwds)
  File "pandas/_libs/parsers.pyx", line 549, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

The corresponding cluster files are here: cluster_id.txt cluster_score.txt

I thought the mistake could be related to the fact that I didn't provide a number to cluster_score, but I changed the string for numbers, and the same error appear again. Then I read more carefully the error, and I think it is related to pandas, but not sure what is the issue exactly

Robaina commented 2 years ago

it seems like your files don't have the correct format. I see empty "cluster IDs" in some rows, also some rows contain long reference names with assigned taxonomy, whereas the majority contains the short name (ref_). In cluster_score, the "score" is meant to be a number and not a string. Also, this is optional

Robaina commented 2 years ago

For the record:

It seems like the error comes from some reference sequences being labelled with integers, gappa examine assign cannot tell the difference between these sequences and the bootstrap value of internal nodes (which acts a default label in newick).

Opening issue #68 to solve this...

Robaina commented 2 years ago

Hi...

closing this issue since the error source was identified and solved in issue #68