Hurence / logisland

Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
https://logisland.github.io
Other
110 stars 28 forks source link

Strip URL query parameters processeur #567

Closed MiniPlayer closed 3 years ago

MiniPlayer commented 3 years ago

Description

The field containing the URL value of the visited website MAY contain a set of query parameters. The following query parameters are undesired (btw the list might evolve):

The presence of the query parameters prevent from processing efficiently the URL in client application. For instance they introduce noise while doing segmentation per visited page. Indeed the same visited page may be considered as 2 different pages for instance.

Needs

Therefore it is asked to be able to build a new field out of a URL value, cleaned from a customizable set of undesired query parameters.

MiniPlayer commented 3 years ago

The name of the processor will be URLCleaner to be like URLDecoder. I suggest two solutions :

conf1 :

url.fields : [comma separated list]            required
param.names : [comma separated list]   required
new.url.fields : [comma separated list]    par defaut la même chose que "url.fields"
conflict.resolution.policy : [overwrite_existing|keep_only_old_field]            required      default to overwrite_existing
remove.all.params: [bolean]                    required       default to false

ou alors une conf avec des propriétés dynamiques pour avoir plus de control individuelle :

myurl : <url>   le nom de l'url
myurl.param.names : [comma separated list]            required
myurl.new.field : [String]          champ ou mettre la nouvelle valeur 
myurl.conflict.resolution.policy : [overwrite_existing|keep_only_old_field] 
myurl.remove.all.params :[bolean]                    required       default to false

Bref la même chose avec des propriétés dynamiques afin de pouvoir spécifier des paramètres différents pour différent urls.

Je recommande la solution 1 qui est plus simple et suffisante.

jerome73 commented 3 years ago

For the processor configuration, it would be worth to be able to specify a list of include parameters and a list of parameters to exclude.

The options cannot be used simultaneously. The default remove All parameters. In addition:

MiniPlayer commented 3 years ago

Okay, first some lecture on URIs here :

The conf of the processor is like that : name type required default
conflict.resolution.policy [overwrite_existing, keep_only_old_field] false keep_only_old_field
url.fields ,,..., true
url.keep.params comma separated list false
url.remove.params comma separated list false
url.remove.all boolean false true (sauf si url.keep.params ou url.remove.params est indiqué)
parameter.separator character false &
key.value.separator character false =

Donc la configuration minimum est :

url.fields : field1

Is is expected that the input is a URI which has not been yet decoded. So escaped character are still escaped. This way we can detect accurately the parameters names and values if the syntax use the correct characters (by default '&' and '='). The processor can remove parameters without value or empty value.

If the input is a decoded URI, we try do remove the parameter none the less but the behaviour is not guaranted in this case, as unescaped value containing '&' or '=' or '#' for example could mess the things up.

MiniPlayer commented 3 years ago

Closing this, the PR is open and waiting for approval. https://github.com/Hurence/logisland/pull/568