Closed davidmadariaga closed 1 year ago
This is what traits.csv looks like:
,IMSAR,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
IsolateA,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
IsolateB,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
IsolateC,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
(...)
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,NDEGEJ_04245,NDEGEJ_04250,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
(...)
I cleaned it up to look like this:
,IMSAR
IsolateA,1
IsolateB,0
IsolateC,0
IsolateD,0
IsolateE,0
IsolateF,1
IsolateG,0
IsolateH,1
IsolateI,1
IsolateJ,0
IsolateK,1
IsolateL,0
IsolateM,0
IsolateN,1
IsolateO,0
IsolateP,1
IsolateQ,0
IsolateR,1
IsolateS,1
IsolateT,1
Moreover, I discovered a minor bug in Scoary2 that prevented it from loading your gene_presence_absence.csv
which is fixed in Scoary2 version 0.0.12
. Make sure to use the newest version!
Use Scoary2 like this and it should work:
scoary2 --genes gene_presence_absence.csv --traits traits.csv --outdir ./caca --gene-data-type gene-list:,
thank u very much!, that worked, you are amazing !!!!!!!!!!!!!!!
No worries! 😄👍
hi again mister, hope you are doing great!, i have another question, feel free to answer it when you have time or are not busy as hell, or just dont answer it if you are not able to, i'll will try to catch the error. Well, this is the question: Can i put numbers in the genomes names? because im having trouble with the .csv file. Now im working with a lot more genomes though.
this is the error:
(myenv) d@dpc:~/Documents/IMSAR/scoary$ scoary2 --genes gene_presence_absence.csv --traits traits.csv --outdir ./caca2 --gene-data-type gene-list:,
Welcome to Scoary2! (0.0.12)
Loading traits...
Loading genes...
Traceback (most recent call last):
File "/home/d/Documents/IMSAR/scoary/myenv/bin/scoary2", line 8, in <module>
sys.exit(main())
File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/scoary/scoary.py", line 289, in main
fire.Fire(scoary)
File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/scoary/scoary.py", line 112, in scoary
genes_orig_df, genes_bool_df = load_genes(
File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/scoary/load_genes.py", line 145, in load_genes
genes_orig_df, genes_bool_df = load_gene_list_file(genes, delimiter, restrict_to, ignore)
File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/scoary/load_genes.py", line 86, in load_gene_list_file
list_df = filter_df(list_df, restrict_to, ignore)
File "/home/d/Documents/IMSAR/scoary/myenv/lib/python3.10/site-packages/scoary/load_genes.py", line 19, in filter_df
assert len(cols_missing) == 0, f'Some strains in restrict_to were not found:' \
AssertionError: Some strains in restrict_to were not found:
cols_missing={62984, 62986, 62987, 62988, 62989, 62991, 45114, 45115, 45119, 45120, 45121, 45122, 45125, 45127, 45128, 45133, 45138, 45139, 31318, 36448, 36450, 36451, 31332, 45157, 45158, 36452, 36453, 45161, 36461, 45167, 45168, 36463, 45172, 45173, 49271, 49272, 82041, 82043, 82044, 82045, 45181, 45184, 45185, 82049, 47235, 45188, 45189, 45190, 82050, 82051, 45187, 82052, 95885, 95886, 95887, 95888, 95889, 95893, 95894, 95895, 95896, 85163, 698, 97476, 39628, 39629, 39630, 39632, 39633, 39634, 39638, 39641, 39645, 39647, 39649, 39650, 39651, 57573, 57574, 57575, 57577, 57582, 57583, 57584, 2290, 82218, 82219, 82220, 82224, 45893, 39775, 42848, 39781, 39791, 39795, 39797, 39798, 58230, 46977, 46979, 39812, 46981, 46980, 46982, 46984, 46983, 46985, 46986, 46987, 46988, 46989, 58253, 46991, 46992, 46993, 46994, 39827, 46996, 47003, 39837, 47006, 47007, 47008, 47009, 47010, 47011, 47012, 39845, 58271, 39847, 39850, 39855, 39861, 39866, 39867, 58300, 84413, 39870, 39871, 39880, 39882, 39884, 47005, 58322, 39892, 39896, 39898, 39899, 39900, 39901, 39903, 81899, 81900, 81901, 81902, 81907, 81908, 81909, 81910, 62972}
restrict_to={62984, 62986, 62987, 62988, 62989, 62991, 45114, 45115, 45119, 45120, 45121, 45122, 45125, 45127, 45128, 45133, 45138, 45139, 31318, 36448, 36450, 36451, 31332, 45157, 45158, 36452, 36453, 45161, 36461, 45167, 45168, 36463, 45172, 45173, 49271, 49272, 82041, 82043, 82044, 45181, 82045, 45184, 45185, 82049, 47235, 45188, 45189, 45190, 82050, 82051, 82052, 45187, 95885, 95886, 95887, 95888, 95889, 95893, 95894, 95895, 95896, 85163, 698, 97476, 39628, 39629, 39630, 39632, 39633, 39634, 39638, 39641, 39645, 39647, 39649, 39650, 39651, 57573, 57574, 57575, 57577, 57582, 57583, 57584, 2290, 82218, 82219, 82220, 82224, 45893, 39775, 42848, 39781, 46984, 39791, 39795, 39797, 39798, 58230, 46977, 46979, 39812, 46981, 46980, 46982, 46983, 46985, 46986, 46987, 46988, 46989, 58253, 46991, 46992, 46993, 46994, 39827, 46996, 47003, 39837, 47006, 47007, 47008, 47009, 47010, 47011, 47012, 39845, 58271, 39847, 39850, 39855, 39861, 39866, 39867, 58300, 84413, 39870, 39871, 39880, 39882, 39884, 47005, 58322, 39892, 39896, 39898, 39899, 39900, 39901, 39903, 81899, 81900, 81901, 81902, 81907, 81908, 81909, 81910, 62972}
have_cols={'39845', '46986', '47010', '82218', '39884', '47003', '46979', '39634', '39871', '39850', '39629', '39798', 'Genome Fragment', '39827', '39870', '47011', '46984', '81910', '82220', '58230', '39900', '81902', '36461', '39647', '45139', '39896', '45172', '82044', '81908', '97476', '57583', '57582', 'Order within Fragment', 'Avg sequences per isolate', '45128', '36448', '39855', '45133', '39633', '46985', '81909', '39837', '39880', '2290', '45161', '82043', '39630', '45114', '39866', '42848', '82041', '46988', '39882', '45157', '47006', '39903', '58300', '46992', '39628', '45167', '58322', 'Accessory Order with Fragment', 'Min group size nuc', '57573', 'Avg group size nuc', '58253', '82045', '58271', '46977', '39791', '82219', '45120', '47005', '39632', '46996', '39812', 'Max group size nuc', '46987', '47012', '81901', '82051', '45188', '36463', '45158', '46991', 'Non-unique Gene name', '39892', '57584', '82224', '57577', '57575', '45122', '39645', '46980', '47235', '39899', '82050', '39795', '39781', '39641', '46983', '45173', '84413', '85163', 'QC', '57574', '36451', '45181', '45138', '47008', '81900', '36452', '31332', '39651', '45125', '45168', '46981', '47009', '39638', '39775', '39867', '45190', '39861', '45121', '46994', 'Accessory Fragment', 'No. isolates', '39847', '45185', '36453', '47007', '45127', '46993', '45189', '46982', '39898', '45115', '46989', 'Annotation', '45187', '698', 'No. sequences', '39797', '82052', '31318', '49271', '39650', '45893', '36450', '81907', '39649', '45119', '81899', '82049', '45184', '39901'}
and these are my input files:
traits.csv gene_presence_absence.csv
Cheers! and have a great weekend!
The latest version (0.0.13
) makes sure the index is always str
.
However, it still does not run with your dataset:
AssertionError: Some strains in restrict_to were not found:
cols_missing={'62988', '62984', '49272', '62987', '95894', '95889', '62986', '95885', '95896', '95887', '62989', '95888', '95893', '95895', '62991', '62972', '95886'}
restrict_to={'45168', '31318', '95894', '39871', '45185', '39629', '45187', '97476', '95886', '47009', '82224', '82052', '46993', '47003', '39630', '57582', '36448', '82051', '49271', '47012', '39850', '45127', '57575', '45188', '46986', '45181', '39845', '39812', '31332', '47011', '45893', '47005', '39641', '39861', '49272', '46992', '36463', '82041', '42848', '45189', '47007', '45133', '698', '39899', '39884', '39775', '39847', '45157', '81899', '81901', '39867', '39827', '45184', '39628', '62972', '62984', '39651', '45190', '95889', '39866', '45138', '58300', '82045', '82219', '58230', '46982', '57577', '82044', '46987', '58253', '45121', '39791', '95887', '95895', '57573', '45172', '39880', '39901', '82043', '81908', '81900', '39896', '39900', '46977', '47235', '39638', '45115', '46988', '39870', '58322', '62989', '39781', '82049', '47008', '36461', '46980', '58271', '36451', '39649', '39645', '45139', '39898', '47010', '84413', '46983', '46979', '39634', '62986', '46991', '45114', '81910', '95888', '62991', '39882', '47006', '46996', '46984', '82220', '39632', '39798', '81902', '39892', '36450', '36452', '39650', '46994', '2290', '45158', '85163', '46985', '82050', '95893', '45122', '39837', '36453', '82218', '45120', '45119', '57574', '81909', '39797', '45128', '95885', '45161', '81907', '39855', '62988', '46989', '62987', '39633', '39795', '95896', '46981', '57583', '45173', '39647', '45167', '39903', '45125', '57584'}
have_cols={'45168', '31318', '39871', '39629', '45185', 'Non-unique Gene name', '45187', '97476', '47009', '82224', '82052', '46993', '47003', '39630', '57582', '36448', 'Min group size nuc', '82051', '49271', '47012', '39850', '45127', '46986', '45188', '57575', '45181', '39845', '39812', '31332', '47011', '45893', '47005', '39641', '39861', '46992', '36463', '82041', '42848', '45189', '47007', '45133', '698', '39899', '39775', '39884', '39847', 'No. sequences', '45157', '81899', 'Avg group size nuc', '81901', '39867', '39827', '39628', '45184', '39651', '45190', '39866', '45138', '58300', '82045', '82219', '58230', 'Accessory Order with Fragment', '46982', '46987', '57577', '82044', '58253', '45121', '39791', '57573', '45172', '39880', '39901', '82043', '81908', '81900', '39896', '39900', '46977', '47235', '39638', '45115', '46988', '39870', '58322', '39781', '82049', 'Accessory Fragment', '47008', '36461', 'Avg sequences per isolate', '46980', '58271', '36451', '39649', '39645', '45139', '39898', '47010', '84413', 'Genome Fragment', '46979', '46983', '39634', '46991', '45114', '81910', '39882', '47006', '46996', '46984', '82220', 'Annotation', '39632', '39798', 'Max group size nuc', '81902', '36452', '36450', '39650', '39892', '46994', '2290', '45158', '85163', '46985', '82050', '45122', '39837', '36453', '82218', '39797', '45119', '45120', '57574', '81909', '45128', 'No. isolates', '45161', '81907', '39855', '46989', 'Order within Fragment', '39633', '39795', '46981', '57583', '39647', 'QC', '45173', '45167', '39903', '45125', '57584'}
17 of the strains in traits.csv, such as 62988, are not present in gene_presence_absence.csv.
Thanks for the answer, if i have news i'll tell you. Thanks again, for taking the time to answer.
Greetings.
hi mister, it's solved. The problem were those 17 strains that were not in the presence_absence file. Once i deleted them from the traits file i could run scoary :DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD, thanks once again. May Guido van Rossum bless you in all your programming
Haha, great! :)
hi sir, hope you are doing great. Could you please help me with this:
thanks, cheers!,
greatings from Chile, South America
btw, these are my input files: gene_presence_absence.csv traits.csv