MrTomRod / scoary-2

Calculate assocations between genes and traits
MIT License
19 stars 1 forks source link

AssertionError: traits='traits.csv': index not unique #5

Closed davidmadariaga closed 1 year ago

davidmadariaga commented 1 year ago

hi sir, hope you are doing great. Could you please help me with this:

(scoary-env) d@dpc:~/Documents/IMSAR/scoary$ scoary2 --genes gene_presence_absence.csv --traits traits.csv --outdir ./caca
Welcome to Scoary2! (0.0.11)
Loading traits...
Traceback (most recent call last):
  File "/home/d/scoary-env/bin/scoary2", line 8, in <module>
    sys.exit(main())
  File "/home/d/scoary-env/lib/python3.10/site-packages/scoary/scoary.py", line 289, in main
    fire.Fire(scoary)
  File "/home/d/scoary-env/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/d/scoary-env/lib/python3.10/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/d/scoary-env/lib/python3.10/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/d/scoary-env/lib/python3.10/site-packages/scoary/scoary.py", line 88, in scoary
    numeric_df, traits_df = load_traits(
  File "/home/d/scoary-env/lib/python3.10/site-packages/scoary/load_traits.py", line 410, in load_traits
    traits_df = load_binary(
  File "/home/d/scoary-env/lib/python3.10/site-packages/scoary/load_traits.py", line 58, in load_binary
    assert traits_df.index.is_unique, f'{traits=}: index not unique'
AssertionError: traits='traits.csv': index not unique

thanks, cheers!,

greatings from Chile, South America

btw, these are my input files: gene_presence_absence.csv traits.csv

MrTomRod commented 1 year ago

This is what traits.csv looks like:

,IMSAR,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
IsolateA,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
IsolateB,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
IsolateC,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
(...)
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NDEGEJ_04245,NDEGEJ_04250,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
(...)

I cleaned it up to look like this:

,IMSAR
IsolateA,1
IsolateB,0
IsolateC,0
IsolateD,0
IsolateE,0
IsolateF,1
IsolateG,0
IsolateH,1
IsolateI,1
IsolateJ,0
IsolateK,1
IsolateL,0
IsolateM,0
IsolateN,1
IsolateO,0
IsolateP,1
IsolateQ,0
IsolateR,1
IsolateS,1
IsolateT,1

Moreover, I discovered a minor bug in Scoary2 that prevented it from loading your gene_presence_absence.csv which is fixed in Scoary2 version 0.0.12. Make sure to use the newest version!

Use Scoary2 like this and it should work:

scoary2 --genes gene_presence_absence.csv --traits traits.csv --outdir ./caca --gene-data-type gene-list:,
davidmadariaga commented 1 year ago

thank u very much!, that worked, you are amazing !!!!!!!!!!!!!!!

MrTomRod commented 1 year ago

No worries! 😄👍

davidmadariaga commented 1 year ago

hi again mister, hope you are doing great!, i have another question, feel free to answer it when you have time or are not busy as hell, or just dont answer it if you are not able to, i'll will try to catch the error. Well, this is the question: Can i put numbers in the genomes names? because im having trouble with the .csv file. Now im working with a lot more genomes though.

this is the error:

(myenv) d@dpc:~/Documents/IMSAR/scoary$ scoary2 --genes gene_presence_absence.csv --traits traits.csv --outdir ./caca2 --gene-data-type gene-list:,
Welcome to Scoary2! (0.0.12)
Loading traits...
Loading genes...
Traceback (most recent call last):
  File "/home/d/Documents/IMSAR/scoary/myenv/bin/scoary2", line 8, in <module>
    sys.exit(main())
  File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/scoary/scoary.py", line 289, in main
    fire.Fire(scoary)
  File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/scoary/scoary.py", line 112, in scoary
    genes_orig_df, genes_bool_df = load_genes(
  File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/scoary/load_genes.py", line 145, in load_genes
    genes_orig_df, genes_bool_df = load_gene_list_file(genes, delimiter, restrict_to, ignore)
  File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/scoary/load_genes.py", line 86, in load_gene_list_file
    list_df = filter_df(list_df, restrict_to, ignore)
  File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/scoary/load_genes.py", line 19, in filter_df
    assert len(cols_missing) == 0, f'Some strains in restrict_to were not found:' \
AssertionError: Some strains in restrict_to were not found:
cols_missing={62984, 62986, 62987, 62988, 62989, 62991, 45114, 45115, 45119, 45120, 45121, 45122, 45125, 45127, 45128, 45133, 45138, 45139, 31318, 36448, 36450, 36451, 31332, 45157, 45158, 36452, 36453, 45161, 36461, 45167, 45168, 36463, 45172, 45173, 49271, 49272, 82041, 82043, 82044, 82045, 45181, 45184, 45185, 82049, 47235, 45188, 45189, 45190, 82050, 82051, 45187, 82052, 95885, 95886, 95887, 95888, 95889, 95893, 95894, 95895, 95896, 85163, 698, 97476, 39628, 39629, 39630, 39632, 39633, 39634, 39638, 39641, 39645, 39647, 39649, 39650, 39651, 57573, 57574, 57575, 57577, 57582, 57583, 57584, 2290, 82218, 82219, 82220, 82224, 45893, 39775, 42848, 39781, 39791, 39795, 39797, 39798, 58230, 46977, 46979, 39812, 46981, 46980, 46982, 46984, 46983, 46985, 46986, 46987, 46988, 46989, 58253, 46991, 46992, 46993, 46994, 39827, 46996, 47003, 39837, 47006, 47007, 47008, 47009, 47010, 47011, 47012, 39845, 58271, 39847, 39850, 39855, 39861, 39866, 39867, 58300, 84413, 39870, 39871, 39880, 39882, 39884, 47005, 58322, 39892, 39896, 39898, 39899, 39900, 39901, 39903, 81899, 81900, 81901, 81902, 81907, 81908, 81909, 81910, 62972}
restrict_to={62984, 62986, 62987, 62988, 62989, 62991, 45114, 45115, 45119, 45120, 45121, 45122, 45125, 45127, 45128, 45133, 45138, 45139, 31318, 36448, 36450, 36451, 31332, 45157, 45158, 36452, 36453, 45161, 36461, 45167, 45168, 36463, 45172, 45173, 49271, 49272, 82041, 82043, 82044, 45181, 82045, 45184, 45185, 82049, 47235, 45188, 45189, 45190, 82050, 82051, 82052, 45187, 95885, 95886, 95887, 95888, 95889, 95893, 95894, 95895, 95896, 85163, 698, 97476, 39628, 39629, 39630, 39632, 39633, 39634, 39638, 39641, 39645, 39647, 39649, 39650, 39651, 57573, 57574, 57575, 57577, 57582, 57583, 57584, 2290, 82218, 82219, 82220, 82224, 45893, 39775, 42848, 39781, 46984, 39791, 39795, 39797, 39798, 58230, 46977, 46979, 39812, 46981, 46980, 46982, 46983, 46985, 46986, 46987, 46988, 46989, 58253, 46991, 46992, 46993, 46994, 39827, 46996, 47003, 39837, 47006, 47007, 47008, 47009, 47010, 47011, 47012, 39845, 58271, 39847, 39850, 39855, 39861, 39866, 39867, 58300, 84413, 39870, 39871, 39880, 39882, 39884, 47005, 58322, 39892, 39896, 39898, 39899, 39900, 39901, 39903, 81899, 81900, 81901, 81902, 81907, 81908, 81909, 81910, 62972}
have_cols={'39845', '46986', '47010', '82218', '39884', '47003', '46979', '39634', '39871', '39850', '39629', '39798', 'Genome Fragment', '39827', '39870', '47011', '46984', '81910', '82220', '58230', '39900', '81902', '36461', '39647', '45139', '39896', '45172', '82044', '81908', '97476', '57583', '57582', 'Order within Fragment', 'Avg sequences per isolate', '45128', '36448', '39855', '45133', '39633', '46985', '81909', '39837', '39880', '2290', '45161', '82043', '39630', '45114', '39866', '42848', '82041', '46988', '39882', '45157', '47006', '39903', '58300', '46992', '39628', '45167', '58322', 'Accessory Order with Fragment', 'Min group size nuc', '57573', 'Avg group size nuc', '58253', '82045', '58271', '46977', '39791', '82219', '45120', '47005', '39632', '46996', '39812', 'Max group size nuc', '46987', '47012', '81901', '82051', '45188', '36463', '45158', '46991', 'Non-unique Gene name', '39892', '57584', '82224', '57577', '57575', '45122', '39645', '46980', '47235', '39899', '82050', '39795', '39781', '39641', '46983', '45173', '84413', '85163', 'QC', '57574', '36451', '45181', '45138', '47008', '81900', '36452', '31332', '39651', '45125', '45168', '46981', '47009', '39638', '39775', '39867', '45190', '39861', '45121', '46994', 'Accessory Fragment', 'No. isolates', '39847', '45185', '36453', '47007', '45127', '46993', '45189', '46982', '39898', '45115', '46989', 'Annotation', '45187', '698', 'No. sequences', '39797', '82052', '31318', '49271', '39650', '45893', '36450', '81907', '39649', '45119', '81899', '82049', '45184', '39901'}

and these are my input files:

traits.csv gene_presence_absence.csv

Cheers! and have a great weekend!

MrTomRod commented 1 year ago

The latest version (0.0.13) makes sure the index is always str.

However, it still does not run with your dataset:

AssertionError: Some strains in restrict_to were not found:
cols_missing={'62988', '62984', '49272', '62987', '95894', '95889', '62986', '95885', '95896', '95887', '62989', '95888', '95893', '95895', '62991', '62972', '95886'}
restrict_to={'45168', '31318', '95894', '39871', '45185', '39629', '45187', '97476', '95886', '47009', '82224', '82052', '46993', '47003', '39630', '57582', '36448', '82051', '49271', '47012', '39850', '45127', '57575', '45188', '46986', '45181', '39845', '39812', '31332', '47011', '45893', '47005', '39641', '39861', '49272', '46992', '36463', '82041', '42848', '45189', '47007', '45133', '698', '39899', '39884', '39775', '39847', '45157', '81899', '81901', '39867', '39827', '45184', '39628', '62972', '62984', '39651', '45190', '95889', '39866', '45138', '58300', '82045', '82219', '58230', '46982', '57577', '82044', '46987', '58253', '45121', '39791', '95887', '95895', '57573', '45172', '39880', '39901', '82043', '81908', '81900', '39896', '39900', '46977', '47235', '39638', '45115', '46988', '39870', '58322', '62989', '39781', '82049', '47008', '36461', '46980', '58271', '36451', '39649', '39645', '45139', '39898', '47010', '84413', '46983', '46979', '39634', '62986', '46991', '45114', '81910', '95888', '62991', '39882', '47006', '46996', '46984', '82220', '39632', '39798', '81902', '39892', '36450', '36452', '39650', '46994', '2290', '45158', '85163', '46985', '82050', '95893', '45122', '39837', '36453', '82218', '45120', '45119', '57574', '81909', '39797', '45128', '95885', '45161', '81907', '39855', '62988', '46989', '62987', '39633', '39795', '95896', '46981', '57583', '45173', '39647', '45167', '39903', '45125', '57584'}
have_cols={'45168', '31318', '39871', '39629', '45185', 'Non-unique Gene name', '45187', '97476', '47009', '82224', '82052', '46993', '47003', '39630', '57582', '36448', 'Min group size nuc', '82051', '49271', '47012', '39850', '45127', '46986', '45188', '57575', '45181', '39845', '39812', '31332', '47011', '45893', '47005', '39641', '39861', '46992', '36463', '82041', '42848', '45189', '47007', '45133', '698', '39899', '39775', '39884', '39847', 'No. sequences', '45157', '81899', 'Avg group size nuc', '81901', '39867', '39827', '39628', '45184', '39651', '45190', '39866', '45138', '58300', '82045', '82219', '58230', 'Accessory Order with Fragment', '46982', '46987', '57577', '82044', '58253', '45121', '39791', '57573', '45172', '39880', '39901', '82043', '81908', '81900', '39896', '39900', '46977', '47235', '39638', '45115', '46988', '39870', '58322', '39781', '82049', 'Accessory Fragment', '47008', '36461', 'Avg sequences per isolate', '46980', '58271', '36451', '39649', '39645', '45139', '39898', '47010', '84413', 'Genome Fragment', '46979', '46983', '39634', '46991', '45114', '81910', '39882', '47006', '46996', '46984', '82220', 'Annotation', '39632', '39798', 'Max group size nuc', '81902', '36452', '36450', '39650', '39892', '46994', '2290', '45158', '85163', '46985', '82050', '45122', '39837', '36453', '82218', '39797', '45119', '45120', '57574', '81909', '45128', 'No. isolates', '45161', '81907', '39855', '46989', 'Order within Fragment', '39633', '39795', '46981', '57583', '39647', 'QC', '45173', '45167', '39903', '45125', '57584'}

17 of the strains in traits.csv, such as 62988, are not present in gene_presence_absence.csv.

davidmadariaga commented 1 year ago

Thanks for the answer, if i have news i'll tell you. Thanks again, for taking the time to answer.

Greetings.

davidmadariaga commented 1 year ago

hi mister, it's solved. The problem were those 17 strains that were not in the presence_absence file. Once i deleted them from the traits file i could run scoary :DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD, thanks once again. May Guido van Rossum bless you in all your programming

MrTomRod commented 1 year ago

Haha, great! :)