KoslickiLab / YACHT

A mathematically characterized hypothesis test for organism presence/absence in a metagenome
MIT License
28 stars 7 forks source link

Error in yacht convert #114

Closed OliverBryan closed 6 months ago

OliverBryan commented 6 months ago

Working with data from https://frl.publisso.de/data/frl:6425521/marine/short_read/marmgCAMI2_sample_0_reads.tar.gz and using a trained version of the gtdb database, I ran into the following error using yacht convert to convert the output of yacht into the cami format. I ran yacht convert --yacht_output 'result.xlsx' --sheet_name 'min_coverage1.0' --genome_to_taxid 'genome_to_taxid.tsv' --mode 'cami' --sample_name 'MySample' --outfile_prefix 'cami_result' --outdir ./ and got the following error:

(yacht_env) oliverbryan@DESKTOP-7KPRH50:~/YACHT/testing$ yacht convert --yacht_output 'result.xlsx' --sheet_name 'min_coverage1.0' --genome_to_taxid 'genome_to_taxid.tsv' --mode 'cami' --sample_name 'MySample'
 --outfile_prefix 'cami_result' --outdir ./
Traceback (most recent call last):
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/scope.py", line 231, in resolve
    return self.resolvers[key]
           ~~~~~~~~~~~~~~^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/collections/__init__.py", line 1014, in __getitem__
    return self.__missing__(key)            # support subclasses that define __missing__
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/collections/__init__.py", line 1006, in __missing__
    raise KeyError(key)
KeyError: 'RANK'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/scope.py", line 242, in resolve
    return self.temps[key]
           ~~~~~~~~~~^^^^^
KeyError: 'RANK'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/oliverbryan/miniconda3/envs/yacht_env/bin/yacht", line 33, in <module>
    sys.exit(load_entry_point('yacht', 'console_scripts', 'yacht')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/YACHT/yacht/__init__.py", line 89, in main
    args.func(args)
  File "/home/oliverbryan/YACHT/yacht/standardize_yacht_output.py", line 135, in main
    standardize_yacht_output.run(
  File "/home/oliverbryan/YACHT/yacht/standardize_yacht_output.py", line 483, in run
    result = self.__to_cami(sample_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/YACHT/yacht/standardize_yacht_output.py", line 307, in __to_cami
    res_df = [summary_df.query(f'RANK == "{rank}"') for rank in self.allowable_rank]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/frame.py", line 4811, in query
    res = self.eval(expr, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/frame.py", line 4937, in eval
    return _eval(expr, inplace=inplace, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/eval.py", line 336, in eval
    parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 809, in __init__
    self.terms = self.parse()
                 ^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 828, in parse
    return self._visitor.visit(self.expr)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 412, in visit
    return visitor(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 418, in visit_Module
    return self.visit(expr, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 412, in visit
    return visitor(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 421, in visit_Expr
    return self.visit(node.value, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 412, in visit
    return visitor(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 719, in visit_Compare
    return self.visit(binop)
           ^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 412, in visit
    return visitor(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 532, in visit_BinOp
    op, op_class, left, right = self._maybe_transform_eq_ne(node)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 452, in _maybe_transform_eq_ne
    left = self.visit(node.left, side="left")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 412, in visit
    return visitor(node, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/expr.py", line 545, in visit_Name
    return self.term_type(node.id, self.env, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/ops.py", line 91, in __init__
    self._value = self._resolve_name()
                  ^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/ops.py", line 115, in _resolve_name
    res = self.env.resolve(local_name, is_local=is_local)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oliverbryan/miniconda3/envs/yacht_env/lib/python3.12/site-packages/pandas/core/computation/scope.py", line 244, in resolve
    raise UndefinedVariableError(key, is_local) from err
pandas.errors.UndefinedVariableError: name 'RANK' is not defined

I have attached my result.xlsx file genome_to_taxid.tsv file (changed to a .txt file since github does not support .tsv files but nothing else is changed about it) for reference. result.xlsx genome_to_taxid.txt

chunyuma commented 6 months ago

Hi @OliverBryan, after checking the files you attached, I found the issue.

The organism_name column in the result.xlsx file doesn't match to the genome_id column in the genome_to_taxid.tsv file. Here is an example:

GCF_000364605.1_genomic is in the genome_to_taxid.tsv file but its corresponding genome name GCF_000364605.1 Nocardioides sp. Iso805N strain=Iso805N, ASM36460v1 in the result.xlsx. You should use either one of them.

OliverBryan commented 6 months ago

I have updated the genome_to_taxid.tsv file to fix this, perhaps I did it incorrectly as I am still getting the same error. I have attached my updated genome_to_taxid.tsv file. I also am now running yacht convert --yacht_output 'result.xlsx' --sheet_name 'raw_result' --genome_to_taxid 'genome_to_taxid.tsv' --mode 'cami' --sample_name 'MySample' --outfile_prefix 'cami_result' --outdir ./ using the raw_result sheet instead of the min_coverage1.0 sheet, but both commands give the same error.

chunyuma commented 6 months ago

Hi @OliverBryan, thanks for trying out my suggestion. I can't find the attached files in your last message.

OliverBryan commented 6 months ago

My apologies @chunyuma I forgot to attach it, I have attached it to this message. genome_to_taxid.txt

chunyuma commented 6 months ago

Hi @OliverBryan, sorry for the late response.

I figured out the issue. When I wrote the script, for some reasons, I didn't expect the genome_id column in the genome_to_taxid.tsv file to have a space like GCF_000364605.1 Nocardioides sp. Iso805N strain=Iso805N, ASM36460v1. I removed the name of each genome in the genome_id column and then got GCF_000364605.1. It works now. I have attached the updated genome_to_taxid.txt file (see below). You can have a try.

genome_to_taxid_test.txt

But thanks for letting me be aware of it. This should be a bug. I have fixed it in the code in this PR.

OliverBryan commented 6 months ago

Hi @chunyuma,

Everything is working on my end with the updated genome_to_taxid_test.tsv file, thank you for your help.