datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
194 stars 39 forks source link

Unpivoting with regex #72

Closed anuveyatsu closed 5 years ago

anuveyatsu commented 5 years ago

How would I unpivot the following table using regex:

2000,2001,2002
a1,b1,c1,d1
a2,b2,c2,d2

I'd call unpivot like below:

unpivoting_fields = [
    {'name': r'\d{4}', 'keys': {'year': r'\d{4}'}}
]
extra_keys = [
    {'name': 'year', 'type': 'year'}
]
extra_value = {'name': 'value', 'type': 'string'}

unpivot(unpivoting_fields, extra_keys, extra_value)

but this results to:

year,value
\\d{4},a1
\\d{4},a2
\\d{4},b1
...

am I missing something?

Once we figure it out, I will update the docs as it would be great to have an example for this one :smile:

zelima commented 5 years ago

@anuveyatsu think problem is in regex. Following works fine

>>> from dataflows import Flow, unpivot
>>> data = [{'2000': 'a1', '2001': 'b1', '2002': 'c1'}, {'2000': 'a2', '2001': 'b2', '2002': 'c2'}, {'2000': 'a3', '2001': 'b3', '2002': 'c3'}]
>>> unpivoting_fields = [{'name': '([0-9]{4})', 'keys': {'year': r'\1'}}]
>>> extra_keys = [{'name': 'year', 'type': 'year'}]
>>> extra_value = {'name': 'value', 'type': 'string'}
>>> Flow(data, unpivot(unpivoting_fields, extra_keys, extra_value)).results()[0]
[[{'year': 2000, 'value': 'a1'}, {'year': 2001, 'value': 'b1'}, {'year': 2002, 'value': 'c1'}, {'year': 2000, 'value': 'a2'}, {'year': 2001, 'value': 'b2'}, {'year': 2002, 'value': 'c2'}, {'year': 2000, 'value': 'a3'}, {'year': 2001, 'value': 'b3'}, {'year': 2002, 'value': 'c3'}]]
>>> 
stefanbole commented 5 years ago

The regex that @zelima worked for me also if we want to replace the year with date format DD-MM-YYYY, we can do as:

unpivoting_fields = [{'name': '([0-9]{4})', 'keys': {'year': '01-01-' r'\1'}}]

anuveyatsu commented 5 years ago

@stefanbole @zelima great but I think the instructions are quite confusing without examples. Maybe we can add one here - https://github.com/datahq/dataflows/blob/master/PROCESSORS.md#unpivotpy

zelima commented 5 years ago

@anuveyatsu @stefanbole Dataflows is child of dpp that has more detailed documentation with examples. I usually cross-check here, when need some extra help https://github.com/frictionlessdata/datapackage-pipelines/#unpivot

adyork commented 5 years ago

I have a guess why your regex may have failed. When I used this in dataflows I used group () in the names and then was able to use \1 \2 etc in the keys to so it would have the resulting matches.

Instead of:

unpivoting_fields = [ {'name': r'\d{4}', 'keys': {'year': r'\d{4}'}}

Try something like this:

{'name': '(\d{4})', 'keys': {'year':r "\1"}}

  • I may have messed up the ' vs " and r there but you get the idea.

Here is an example from when I did this in dataflows:

I used the regex to great effect transforming columns with station and depth in the column names.

    protein_id,station4_20,station4_60,station4_125,station4_200,station4_300,...,station14_600
    1,1,1,86,143,115,134,161,0,0,0,367,345,285,279,306,0,0,0,0,38,79,120,123,188,122,192,196,235,206,187,126,0,0,16,132,239,170,246,204,3,14,0,2,0,272,93,156,373,301,158,223,...221              

to:

    protein_id,station,depth,spectral_count
    1,4,20,1
    1,4,60,1
    1,4,125,86
    1,4,200,143

and in the flow this is my unpivot:

    unpivot(
                [
                    dict(name='station([0-9]*)_([0-9]*)',keys=dict(station=r"\1",depth=r"\2"))
                ],
                [ #give the new column names and types
                    dict(
                        name='station',
                        type='integer'
                    ),
                    dict(
                        name='depth',
                        type='integer'
                    ),

                ],
                dict( #new column for the orginal value
                    name='spectral_count',
                    type='integer'
                ),
                resources='protein_spectral_counts'
            ),
adyork commented 5 years ago

The regex that @zelima worked for me also if we want to replace the year with date format DD-MM-YYYY, we can do as:

unpivoting_fields = [{'name': '([0-9]{4})', 'keys': {'year': '01-01-' r'\1'}}]

Sorry, I missed that @zelima had the fix and this was just open still for documentation. In any case, that is another example.

rufuspollock commented 5 years ago

FIXED.