Closed anuveyatsu closed 5 years ago
@anuveyatsu think problem is in regex. Following works fine
>>> from dataflows import Flow, unpivot
>>> data = [{'2000': 'a1', '2001': 'b1', '2002': 'c1'}, {'2000': 'a2', '2001': 'b2', '2002': 'c2'}, {'2000': 'a3', '2001': 'b3', '2002': 'c3'}]
>>> unpivoting_fields = [{'name': '([0-9]{4})', 'keys': {'year': r'\1'}}]
>>> extra_keys = [{'name': 'year', 'type': 'year'}]
>>> extra_value = {'name': 'value', 'type': 'string'}
>>> Flow(data, unpivot(unpivoting_fields, extra_keys, extra_value)).results()[0]
[[{'year': 2000, 'value': 'a1'}, {'year': 2001, 'value': 'b1'}, {'year': 2002, 'value': 'c1'}, {'year': 2000, 'value': 'a2'}, {'year': 2001, 'value': 'b2'}, {'year': 2002, 'value': 'c2'}, {'year': 2000, 'value': 'a3'}, {'year': 2001, 'value': 'b3'}, {'year': 2002, 'value': 'c3'}]]
>>>
The regex that @zelima worked for me also if we want to replace the year with date format DD-MM-YYYY, we can do as:
unpivoting_fields = [{'name': '([0-9]{4})', 'keys': {'year': '01-01-' r'\1'}}]
@stefanbole @zelima great but I think the instructions are quite confusing without examples. Maybe we can add one here - https://github.com/datahq/dataflows/blob/master/PROCESSORS.md#unpivotpy
@anuveyatsu @stefanbole Dataflows is child of dpp that has more detailed documentation with examples. I usually cross-check here, when need some extra help https://github.com/frictionlessdata/datapackage-pipelines/#unpivot
I have a guess why your regex may have failed. When I used this in dataflows I used group () in the names and then was able to use \1 \2 etc in the keys to so it would have the resulting matches.
Instead of:
unpivoting_fields = [ {'name': r'\d{4}', 'keys': {'year': r'\d{4}'}}
Try something like this:
{'name': '(\d{4})', 'keys': {'year':r "\1"}}
- I may have messed up the ' vs " and r there but you get the idea.
Here is an example from when I did this in dataflows:
I used the regex to great effect transforming columns with station and depth in the column names.
protein_id,station4_20,station4_60,station4_125,station4_200,station4_300,...,station14_600
1,1,1,86,143,115,134,161,0,0,0,367,345,285,279,306,0,0,0,0,38,79,120,123,188,122,192,196,235,206,187,126,0,0,16,132,239,170,246,204,3,14,0,2,0,272,93,156,373,301,158,223,...221
to:
protein_id,station,depth,spectral_count
1,4,20,1
1,4,60,1
1,4,125,86
1,4,200,143
and in the flow this is my unpivot:
unpivot(
[
dict(name='station([0-9]*)_([0-9]*)',keys=dict(station=r"\1",depth=r"\2"))
],
[ #give the new column names and types
dict(
name='station',
type='integer'
),
dict(
name='depth',
type='integer'
),
],
dict( #new column for the orginal value
name='spectral_count',
type='integer'
),
resources='protein_spectral_counts'
),
The regex that @zelima worked for me also if we want to replace the year with date format DD-MM-YYYY, we can do as:
unpivoting_fields = [{'name': '([0-9]{4})', 'keys': {'year': '01-01-' r'\1'}}]
Sorry, I missed that @zelima had the fix and this was just open still for documentation. In any case, that is another example.
FIXED.
How would I unpivot the following table using regex:
I'd call
unpivot
like below:but this results to:
am I missing something?
Once we figure it out, I will update the docs as it would be great to have an example for this one :smile: