Open roedoejet opened 8 months ago
It seems that we don't have to-replace
supported in the wizard. Will take a look at this.
It seems that we don't have
to-replace
supported in the wizard. Will take a look at this.
I think that's fine. It's a bit advanced, and there isn't an obvious way (to me) to create the interaction in the wizard. I think it's alright if we just document it in the docs and tell people to adjust the configuration file if necessary.
It may confuse the user to set global_cleaner
and dataset-specific_cleaner
separately, while I totally agree that we should set these two cleaners. How about we set global_cleaner
to collapse_white_space
by default (in everyvoice.config.text_config.TextConfig), and ask the user to set the dataset-specific cleaners (in everyvoice.config.preprocessing_config.Dataset)?
It may confuse the user to set
global_cleaner
anddataset-specific_cleaner
separately, while I totally agree that we should set these two cleaners. How about we setglobal_cleaner
tocollapse_white_space
by default (in everyvoice.config.text_config.TextConfig), and ask the user to set the dataset-specific cleaners (in everyvoice.config.preprocessing_config.Dataset)?
Good idea!
I still think we should have cleaners defined on the
everyvoice.config.text_config.TextConfig
but we should rename them toglobal_cleaners
andglobal_to_replace
. There are some cleaners/to_replace rules that only apply to certain datasets, and those should be defined oneveryvoice.config.preprocessing_config.Dataset
.In addition to adding the cleaners here, we also need to: