Gio345L / STEC_dataset

STEC isolates information from multiple organism and store into NCBI to track the origin of the isolates.
0 stars 0 forks source link

Clean-up of dataset part1 #3

Open Gio345L opened 1 year ago

Gio345L commented 1 year ago

Clean up the following columns:

Strain needs to be removed

Isolate identifiers need to remove Serovar: needed only serotype information, remove any other information: "E. coli", ":", "Undetermined", "double blank spaces" and replace pending/ongoing identification for "ongoing" as example: "Isolate to CDC:Isolate to CDC" to "ongoing" Create date need to be rename to "date", need to remove time information, example: "2022-05-10T13:37:17Z" to "2022-05-10". Location: need to remove remane stated for abbreviation (example: California to CA" and split in 3 columns: Country, State and city. Isolation source need to be rename to "Isolation", and homogenized, A new column "Source" need to be done with classification: Clinical, Food, Animal, Environmental based on isolation column, example: "coyote feces" is classified into "animal"