Closed bloomarun closed 1 year ago
Hi @bloomarun
Yes, Scoary2 can binarize them for you. This is described in Wiki > Inputs. Are the instructions clear enough?
I don't follow, dominant and recessive are terms normally used in the context of polyploid genomes. Scoary2 assumes clonal reproduction so it may be the wrong tool for you!
My initial idea how to encode the color would be to create a binary "trait" per color, for example:
Trait | pink | white | yellow |
---|---|---|---|
isolate-1 | 1 | 0 | 0 |
isolate-2 | 0 | 0 | 1 |
isolate-3 | 0 | 1 | 0 |
Hello @MrTomRod Thanks for the reply. Okay, This is the scenario. I am trying to look at antibiotic resistance patterns in a dataset of bacterial genomes. My gene input is roary gene-presence-absence.csv and the traits file has data of Antibiotic susceptibility with Susceptible(S) , Resistant(R) or Intermediate resistant(I). Can I quantize them as S=0, I=0.5, R=1?
Another error when I tried to run with the data as I have described above, there is the following error while parsing the genes file:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3592, saw 13 command run: scoary2 gene_presence_absence.csv --gene-data-type 'gene-list:,' --traits traits.csv --trait-data-type 'gaussian:kmeans:,' --n-cpus 96 --outdir scoary2_out (both are .csv files)
the traits file has data of Antibiotic susceptibility with Susceptible(S) , Resistant(R) or Intermediate resistant(I). Can I quantize them as S=0, I=0.5, R=1?
You can, but Scoary2 will simply binarize your data. It is better to do that manually in your case, imo.
pandas.errors.ParserError
Can you send me the dataset?
Hello @MrTomRod
I am trying to run scoary2 using the following command (scoary2 --genes /project/genomics/fatma/B1_vs_plant_vs_soil_vs_human_vs_aquatic/orthofinder/Orthofinder_prokka/OrthoFinder/Results_Jul25/Phylogenetic_Hierarchical_Orthogroups/N0.tsv --genes-data-type 'gene-list:\t' --gene-info N0_best_names.tsv --traits traits.tsv --trait-data-type 'binary:\t' --n-cpus 16 --outdir output
). I am using the raw output file N0 and the traits file is binary(the traits are 4 groups representing origin of the strain " plant, soil, human and aquatic") and the isolates are given 1 for the group to which it belongs and 0 for the other groups. But I am getting this error
Loading traits...
Loading genes...
Welcome to Scoary2! (0.0.11)
Traceback (most recent call last):
File "/home/comi/fatma.mahmoud/venv/bin/scoary2", line 8, in <module>
sys.exit(main())
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/scoary/scoary.py", line 289, in main
fire.Fire(scoary)
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/fire/core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/scoary/scoary.py", line 112, in scoary
genes_orig_df, genes_bool_df = load_genes(
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/scoary/load_genes.py", line 142, in load_genes
genes_orig_df, genes_bool_df = load_gene_count_file(genes, delimiter, restrict_to, ignore)
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/scoary/load_genes.py", line 45, in load_gene_count_file
count_df = pd.read_csv(path, delimiter=delimiter, index_col=0)
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 611, in _read
return parser.read(nrows)
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/home/comi/fatma.mahmoud/venv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 6, saw 4
I really appreciate your help
@Fatma116 The reason is that your argument is --genes-data-type
, but it should be --gene-data-type
!
@MrTomRod Sorry for this stupid mistake. Though I revised the code man times I couldn't notice it, but it is working now. Thanks a lot for your help
@bloomarun
Does the problem persist or can I close the issue?
Yes You can close the issue.. I will revert if I have any other queries Thank you for your time and efforts..
On Mon, 28 Aug 2023 at 7:24 PM, Thomas Roder @.***> wrote:
@bloomarun https://github.com/bloomarun
Does the problem persist or can I close the issue?
— Reply to this email directly, view it on GitHub https://github.com/MrTomRod/scoary-2/issues/4#issuecomment-1695744848, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMS7DFXOB26SSAKZPJM2Z43XXSPKJANCNFSM6AAAAAA2L3SMI4 . You are receiving this because you were mentioned.Message ID: @.***>
-- Thanks and Regards: P.Arun Sai Kumar 9392808199 @. @.>*
Hello! I have Roary output data (gene_presence_absence.csv) of about 450 isolates belonging to a single species. I have a traits file. But the traits are not binary. They are continuous. The pre-print says that scoary2 can work with continuous traits. How do I need to format my traits file and add a flag to scoary2, telling that my data is continuous in nature? P.S: We are talking of something like the color of a petal, where there is incomplete dominance. The flower can be Pink (Dominant), White (Recessive) or Yellow (Hybrid). How do I pass these as traits?