KEGG-to-avio TypeError: StringMethods.split() takes from 1 to 2 positional arguments but 3 were given

Hi,

When I run this command I'm getting type error because of this line y =pd.DataFrame(x[3].str.split(' ',1).tolist(),columns=['accession','description']), you can check the content of the files below, I don't know for whatever reason, the table looks slightly different than the one shown here.

$ python3 KEGG-to-anvio --KeggDB ../KO_Orthology_ko00001.txt -i ../user_ko.txt -o KeggAnnotations-AnviImportable.txt

Traceback (most recent call last): File "/mnt/f/Mayur/GhostKOALA_results/GhostKoalaParser/KEGG-to-anvio", line 24, in y =pd.DataFrame(x[3].str.split(' ',1).tolist(),columns=['accession','description']) File "/home/tgangar/miniconda3/envs/anvio-8/lib/python3.10/site-packages/pandas/core/strings/accessor.py", line 137, in wrapper return func(self, *args, **kwargs) TypeError: StringMethods.split() takes from 1 to 2 positional arguments but 3 were given

$ cat ../KO_Orthology_ko00001.txt | head

09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] K00844 HK; hexokinase [EC:2.7.1.1] 09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] K12407 GCK; glucokinase [EC:2.7.1.2] 09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] K00845 glk; glucokinase [EC:2.7.1.2] 09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] K25026 glk; glucokinase [EC:2.7.1.2] 09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] K01810 GPI, pgi; glucose-6-phosphate isomerase [EC:5.3.1.9] 09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] K06859 pgi1; glucose-6-phosphate isomerase, archaeal [EC:5.3.1.9] 09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] K13810 tal-pgi; transaldolase / glucose-6-phosphate isomerase [EC:2.2.1.2 5.3.1.9] 09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] K15916 pgi-pmi; glucose/mannose-6-phosphate isomerase [EC:5.3.1.9 5.3.1.8] 09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] K24182 PFK9; 6-phosphofructokinase [EC:2.7.1.11] 09100 Metabolism 09101 Carbohydrate metabolism 00010 Glycolysis / Gluconeogenesis [PATH:ko00010] K00850 pfkA, PFK; 6-phosphofructokinase 1 [EC:2.7.1.11]

$ cat ../user_ko.txt | head

genecall_0 K03569 genecall_1 K03570 genecall_2 K03571 genecall_3 K05515 genecall_4 K05837 genecall_5 K08305 genecall_6 K03642 genecall_7 K07258 genecall_8 K00824 genecall_9 K13771

Thanks!

#!/usr/bin/env python import pandas as pd import argparse """ Parses annotation results from KEGG and optionally will pull in results from interproscan. Assumes interproscan was run using the following flags: -f tsv --goterms --iprlookup --pathways. """ parser = argparse.ArgumentParser(description='Combines annotation Data for input to anvio') parser.add_argument('--KeggDB', help='Identify the Kegg Orthology file (modified from htext using given bash script)') parser.add_argument('-i', help='Specify the file containing GhostKoala Results') parser.add_argument('--interproscan', help='Interproscan results') parser.add_argument('-o', help='Specify an output file') args = parser.parse_args() arg_dict = vars(args) keggortho_database = arg_dict['KeggDB'] output = arg_dict['o'] GK_results = arg_dict['i'] # Read in KO_Orthology file and format for downstream analysis x = pd.read_table(keggortho_database, header=None, sep='\t') # Split the last column into accession and description manually accessions = [] descriptions = [] for item in x[3]: parts = item.split(' ', 1) accessions.append(parts[0]) descriptions.append(parts[1] if len(parts) > 1 else '') x['accession'] = accessions x['description'] = descriptions # Drop the original column and set index xy = x.drop(3, axis=1).set_index('accession') xy.columns = ["Category1", "Category2", "Category3", "description"] xy.to_csv("KeggOrthology_Table1.txt", encoding='utf-8') # Process GhostKoala results keggAnnotation = pd.read_table(GK_results, header=None, names=["gene_callers_id", "accession"], index_col=None) keggAnnotation = keggAnnotation.replace({'genecall_': ''}, regex=True) keggAnnotation = keggAnnotation.dropna().set_index("accession") merged = keggAnnotation.join(xy) merged_reduced = merged.drop_duplicates(subset='gene_callers_id', keep="last") # Extract relevant information and format for output extracted = merged_reduced.filter(['gene_callers_id', 'description', 'accession']).reset_index().set_index('gene_callers_id') e_value = [0] * len(extracted['accession'].tolist()) source = ['KeggGhostKoala'] * len(extracted['accession'].tolist()) extracted.insert(0, 'source', source) extracted.insert(3, 'e_value', e_value) extracted = extracted.rename(columns={'description': 'function'}, index=str) print(extracted.head()) if arg_dict["interproscan"] is not None: interpro = pd.read_table(arg_dict["interproscan"], header=None, names=[ "gene_callers_id", "MD5", "Length", "source", "accession", "function", "start_loc", "stop_loc", "e_value", "status", "date", "InterProAccession", "InterProDescription", "GOAnnotations", "Pathway"]) InterProExtracted = interpro.filter(["gene_callers_id", "source", "accession", "function", "e_value"]) InterProExtracted = InterProExtracted.replace({'genecall_': ''}, regex=True) InterProExtracted['e_value'] = InterProExtracted['e_value'].replace('-', 0) InterProExtracted = InterProExtracted.set_index("gene_callers_id") KEGG_InterPro_Combined = pd.concat([extracted, InterProExtracted]) KEGG_InterPro_Combined.to_csv(output, sep='\t') else: extracted.to_csv(output, sep='\t')

edgraham / GhostKoalaParser

KEGG-to-avio TypeError: StringMethods.split() takes from 1 to 2 positional arguments but 3 were given #11