Replace - Githubissues

AcckiyGerman commented 6 years ago

As a pipeline user I want to be able to find some strings and replace them automatically when processing data, so I don't need to do it manually.

As a tabulator-py user I want to find and replace strings in my data, using one simple parameter, so that I don't need to write a post-processor script.

dataset examples where we need a replace ability

https://github.com/datasets/cash-surplus-deficit/blob/master/scripts/process.py [lines 27,28]
https://github.com/datasets/co2-ppm/blob/master/scripts/process.sh [line 42]
https://github.com/datasets/gdp-us/blob/master/scripts/process.py [lines 36-45]

Analyse

replace format

I will use a dictionary to pass agruments into the Stream constructor, so later we could extend 'replace' with more keywords:
replace={'old': 'q1', 'new': '-03-31'}

For several replacements we could use a list of dictionaries:

replace=[
{'old': 'q1', 'new': '-03-31'},
{'old': 'q2', 'new': '-06-30'}
]

[ ] [+2h] To enable RegExp we could pass 'regex': True. Here's a real example from the script, that need an automation: replace('\.|"|,|\'|:|-|\(|\)', '', regex=True). In such case the parameter will look like that:
```
replace={
'old': '\.|"|,|\'|:|-|\(|\)',
'new': '',
'regex': True
}
```
[ ] [+2h] We could specify column to apply replace function to. (e.g we need to replace 'q1' to '-03-31' but there could be 'q1' in other columns, that we are not want to break)
```
replace={'old': 'q1',
     'new': '-03-31',
     'regex': True,
    'column': N or 'name'}
```

Tasks

[x] analyze [30m]
[x] learn tabulator-py docs, run examples [20m]
[x] find and analyse the 'Stream' commands parser [20m]
[x] write tests for a new processorreplace: "old", "new" [20m]
[ ] create the function [50m]
[ ] check all the old and new tests are working [10m]
[ ] docs: update 'Stream commands' and create a 'replace' section [25m]
[ ] do post-review fixes [20m]
[ ] PR to main repo [20m]
[ ] fix the PR post-review fixes, if any [0..20m] Overall [4 to 4.5h]

Original: https://github.com/AcckiyGerman/tabulator-py/issues/2

pwalsh commented 6 years ago

@AcckiyGerman this is very out of scope for the Stream constructor. The post_parse API is designed for these types of use cases - please use it.

@roll I recommend we close this, if you agree.

AcckiyGerman commented 6 years ago

Yes, I agree - the stream constructor is wrong place for such a function :+1:

roll commented 6 years ago

@AcckiyGerman You could always have a set of tabulator processors inside your own project. Also it could be released on PyPi as a separate module providing some set of processors for tabulator. We could check on enabling plugin system if you will be interested. We have a working example for tableschema-py.

AcckiyGerman commented 6 years ago

@roll @pwalsh Thanks for replies! I'm still learning the frictionless code infrastructure with the aim to use pipelines. I would try to use processor plugins in the pipeline conf file. If I meet any troubles I'll ask in your gitter channel :)

frictionlessdata / tabulator-py

Replace #227

dataset examples where we need a replace ability

Analyse

replace format

Tasks