frictionlessdata / tabulator-py

Python library for reading and writing tabular data via streams.

https://frictionlessdata.io

MIT License

235 stars 42 forks source link

A new parameter to capture skipped rows metadata as a column #331

Closed cschloer closed 4 years ago

cschloer commented 4 years ago

Overview

Hi,

I'd like to propose a parameter to tabulator called "skipped_rows_capture" or something similar. It takes in a list of dicts, each dict containing a regular expression string with once captured group, and one string that contains a column name. The regular expression is then compared to each skipped row in the data.

For example:


skipped_rows_capture = [{ 'regex': '\*\* Latitud (.*)$', 'name': 'latitude' }]
skip_rows = ['**']

Would match the comment/skipped line:

** Latitud 10 29.99

And create a new column

latitude,
10 29.99
10 29.99
...

Please preserve this line to notify @roll (lead of this repository)

roll commented 4 years ago

@cschloer With tabulator you can extract information like this using post_parse - https://github.com/frictionlessdata/tabulator-py#post-parse

For example:

id,name
1,english
** Lat 50
1,german

import re
from tabulator import Stream

def capture(store, name, regex):
    pattern = re.compile(regex)
    def processor(extended_rows):
        for row_number, headers, row in extended_rows:
            match = pattern.match(row[0] if row else '')
            if match:
                store[name] = int(match.group(1))
                continue
            yield (row_number, headers, row)
    return processor

store = {}
with Stream('tmp/issue331.csv', post_parse=[capture(store, 'lat', r'^\*\* Lat (.*)')]) as stream:
    print(stream.read())  # [['id', 'name'], ['1', 'english'], ['1', 'german']]
    print(store)  # {'lat': 50}

roll commented 4 years ago

Reshaping like adding a column is out of the scope of tabulator but if you're interested in having such extractor available in DPP we can think of a dataflows processor / load parameter to achieve the goal. Of course, if you need it in Python you just can use the snippet above.

Please create a DPP issue if it's still needed or re-open this one if you still think it's a good addition for tabulator