Closed MiniPlayer closed 3 years ago
The name of the processor will be URLCleaner to be like URLDecoder. I suggest two solutions :
conf1 :
url.fields : [comma separated list] required
param.names : [comma separated list] required
new.url.fields : [comma separated list] par defaut la même chose que "url.fields"
conflict.resolution.policy : [overwrite_existing|keep_only_old_field] required default to overwrite_existing
remove.all.params: [bolean] required default to false
ou alors une conf avec des propriétés dynamiques pour avoir plus de control individuelle :
myurl : <url> le nom de l'url
myurl.param.names : [comma separated list] required
myurl.new.field : [String] champ ou mettre la nouvelle valeur
myurl.conflict.resolution.policy : [overwrite_existing|keep_only_old_field]
myurl.remove.all.params :[bolean] required default to false
Bref la même chose avec des propriétés dynamiques afin de pouvoir spécifier des paramètres différents pour différent urls.
Je recommande la solution 1 qui est plus simple et suffisante.
For the processor configuration, it would be worth to be able to specify a list of include parameters and a list of parameters to exclude.
The options cannot be used simultaneously. The default remove All parameters. In addition:
url.field.name: \<name>|\<name:newName> <--
support the conflict resolution policy [default value: OVERWRITE]
Okay, first some lecture on URIs here :
The conf of the processor is like that : | name | type | required | default |
---|---|---|---|---|
conflict.resolution.policy | [overwrite_existing, keep_only_old_field] | false | keep_only_old_field | |
url.fields | true | |||
url.keep.params | comma separated list | false | ||
url.remove.params | comma separated list | false | ||
url.remove.all | boolean | false | true (sauf si url.keep.params ou url.remove.params est indiqué) | |
parameter.separator | character | false | & | |
key.value.separator | character | false | = |
Donc la configuration minimum est :
url.fields : field1
Is is expected that the input is a URI which has not been yet decoded. So escaped character are still escaped. This way we can detect accurately the parameters names and values if the syntax use the correct characters (by default '&' and '='). The processor can remove parameters without value or empty value.
If the input is a decoded URI, we try do remove the parameter none the less but the behaviour is not guaranted in this case, as unescaped value containing '&' or '=' or '#' for example could mess the things up.
Closing this, the PR is open and waiting for approval. https://github.com/Hurence/logisland/pull/568
Description
The field containing the URL value of the visited website MAY contain a set of query parameters. The following query parameters are undesired (btw the list might evolve):
The presence of the query parameters prevent from processing efficiently the URL in client application. For instance they introduce noise while doing segmentation per visited page. Indeed the same visited page may be considered as 2 different pages for instance.
Needs
Therefore it is asked to be able to build a new field out of a URL value, cleaned from a customizable set of undesired query parameters.