Streaming File Content with a Generator

Download data and store it on your local machine


import logging, urllib3, shutil
logging.basicConfig(level=logging.INFO)

def download_dataset(url, LOCAL_FILE_NAME): urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) c = urllib3.PoolManager() with c.request("GET", url, preload_content=False) as res, open( LOCAL_FILE_NAME, "wb" ) as out_file: shutil.copyfileobj(res, out_file) logging.info("Dowload completed.")

logging.info("Started download script") URL = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv' LOCAL_FILE_NAME = "data/pima-indians-diabetes.data.csv" download_dataset(URL, LOCAL_FILE_NAME)

- The columns in this file are:
```python
['Pregnancies','Glucose','BloodPressure',
 'SkinThickness','Insulin','BMI',
 'DiabetesPedigree','Age','Outcome']

Let's look at this file with the following lines of code
```
import csv
import pandas as pd
```

file_path = 'data/' file_name = 'pima-indians-diabetes.data.csv'

col_name = ['Pregnancies','Glucose','BloodPressure', 'SkinThickness','Insulin','BMI', 'DiabetesPedigree','Age','Outcome'] pd.read_csv(file_path + file_name, names = col_name)

- Since we want to stream this dataset, it is more convenient to read it as a CSV file and use the generator to output the rows. The way to do this is through the following code:
```python
import csv
file_path = 'data/'
file_name = 'pima-indians-diabetes.data.csv'

# use the `open` command to create a file handle object, `csvfile`
# , that knows where the file is stored.
with open(file_path + file_name, newline='\n') as csvfile:
  # Pass it to the `reader` function in the CSV library 
  f = csv.reader(csvfile, delimiter=',')
  # `f` is the entire file in the Python runtime memory.
  # To inspect the file, execute this short piece of a for loop
  for row in f:
    print(','.join(row))

Make a generator to stream the content of the file:


def stream_file(file_handle):
holder = []
for row in file_handle:
# Removes newline code \n, then fills up a holder
holder.append(row.rstrip("\n"))
yield holder
holder = []

with open(file_path + file_name, newline='\n') as handle:

Pass `handle` to a generator function `stream_file`,

which contains a for loop that iterates through the file

in `handle` row by row

for part in stream_file(handle):

print yield from the generator

print(part)

### Setting Up a Pattern for Filenames
- Download data and store it on your local machine
```python
import logging, urllib3, shutil
logging.basicConfig(level=logging.INFO)

def download_dataset(url, LOCAL_FILE_NAME):
  urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
  c = urllib3.PoolManager()
  with c.request("GET", url, preload_content=False) as res, open(
      LOCAL_FILE_NAME, "wb"
  ) as out_file:
    shutil.copyfileobj(res, out_file)
  logging.info("Dowload completed.")

logging.info("Started download script")
URL = 'https://covid.ourworldindata.org/data/owid-covid-data.csv'
LOCAL_FILE_NAME = "dataset/owid-covid-data.csv"
download_dataset(URL, LOCAL_FILE_NAME)

Inspect the file and determine the number of rows:
```
$ wc -l dataset/owid-covid-data.csv
```
Inspect the first three lines of the CSV file to see if there is a header:
```
$ head -3 dataset/owid-covid-data.csv
```
Splitting a Single CSV File into Multiple CSV Files
For linux, you may need to first install the parallel command:
```
$ apt install parallel
```
Now let's split this file into multiple CSV files, each with 330 rows. You should end up with 100 CSV files, each of which has the header. If you use Linux or macOS, use the following command:
```
$ cat dataset/owid-covid-data.csv| parallel --header : --pipe -N330 'cat > dataset/owid-covid-data-part00{#}.csv'
```
Creating a File Pattern Object Using tf.io
tf.io API simply leverages the glob library to generate a list of filenames that fit the pattern object:
```
import tensorflow as tf
```

base_pattern = 'dataset' file_pattern = 'owid-covid-data-part*' files = tf.io.gfile.glob(base_pattern + '/' + file_pattern)

- `files` is a list that contains all the CSV filenames that are part of the original CSV, in no particular order
#### Creating a Streaming Dataset Object
- For the purposes of this example, `new_deaths` is selected as the target column:
```python
csv_dataset = tf.data.experimental.make_csv_dataset(
    files,
    header=True,
    batch_size=5,
    label_name='new_deaths',
    num_epochs=1,
    ignore_errors=True
)

To look at actual data, you'll need to use the csv_dataset object to iterate through data:

for features, target in csv_dataset.take(1):
print("Target: {}".format(target))
print("Features:")
for k, v in features.items():
print(" {!r:20s}: {}".format(k, v))

Streaming a CSV Dataset

features, label = next(iter(csv_dataset))

Using TensorFlow Image Generator

First let's download the images
```
import tensorflow as tf
```

data_dir = tf.keras.utils.get_file( 'flower_photos', 'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz', untar=True )

#### There are three steps to streaming the images. Let's look more closely:
- Create an `ImageDataGenerator` object and specify normalization parameters. Use the `rescale` parameter to indicate the normalization scale and the `validation_split` parameter to specify that 20% of data will be set aside for cross validation:
```python
datagen_kwargs = dict(rescale=1./255, validation_split=0.20)
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    **datagen_kwargs
)

Connect the ImageDataGenerator object to the data source and specify parameters to resize the images to a fixed dimension:

IMAGE_SIZE = (244, 244) # Image height and width
BATCH_SIZE = 32
dataflow_kwargs = dict(
target_size=IMAGE_SIZE,
batch_size=BATCH_SIZE,
interpolation="bilinear"
)
train_generator = train_datagen.flow_from_directory(
data_dir,
subset="training",
shuffle=True,
**dataflow_kwargs
)

Prepare a map for indexing the labels. In this stepm you retrieve the index that the generator has assigned to each label and create a dictionary that maps it to the actual label name. The TensorFlow generator internally keeps track of labels from the directory name below data_dir. They can be retrieved through train_generator.class_indices, which returns a key-value pair of labels and indices. You can take advantage of this and reverse it to deploy the model for scoring. The model will output the index. To implement this reverse lookup, simply reverse the label dictionary generated by train_generator.class_indices:
```
label_idx = (train_generator.class_indices)
idx_labels = dict((v, k) for k, v in label_idx.items())
```
```
for image_batch, label_batch in train_generator:
print(image_batch.shape)
print(label_batch.shape)
break
```
Streaming Cross-Validation Images
```
valid_datagen = train_datagen
valid_generator = valid_datagen.flow_from_directory(
data_dir,
subset='validation',
shuffle=False,
**dataflow_kwargs
)
```
Inspecting Resized Images
```
import matplotlib.pyplot as plt
import numpy as np
```

image_batch, label_batch = next(iter(train_generator))

fig, axes = plt.subplots(8, 4, figsize=(10, 20)) axes = axes.flatten() for img, lbl, ax in zip(image_batch, labelbatch, axes): ax.imshow(img) label = np.argmax(lbl) label = idxlabels[label] ax.set_title(label) ax.axis('off') plt.show()

chanelcolgate / hydroelectric-project

Data Storage and Ingestion #19

Streaming File Content with a Generator

Pass `handle` to a generator function `stream_file`,

which contains a for loop that iterates through the file

in `handle` row by row

print yield from the generator

Splitting a Single CSV File into Multiple CSV Files

Creating a File Pattern Object Using tf.io

Streaming a CSV Dataset

Using TensorFlow Image Generator

Streaming Cross-Validation Images

Inspecting Resized Images

chanelcolgate / hydroelectric-project

Data Storage and Ingestion #19

Streaming File Content with a Generator

Pass handle to a generator function stream_file,

which contains a for loop that iterates through the file

in handle row by row

print yield from the generator

Splitting a Single CSV File into Multiple CSV Files

Creating a File Pattern Object Using tf.io

Streaming a CSV Dataset

Using TensorFlow Image Generator

Streaming Cross-Validation Images

Inspecting Resized Images

Pass `handle` to a generator function `stream_file`,

in `handle` row by row