- The columns in this file are:
```python
['Pregnancies','Glucose','BloodPressure',
'SkinThickness','Insulin','BMI',
'DiabetesPedigree','Age','Outcome']
Let's look at this file with the following lines of code
- Since we want to stream this dataset, it is more convenient to read it as a CSV file and use the generator to output the rows. The way to do this is through the following code:
```python
import csv
file_path = 'data/'
file_name = 'pima-indians-diabetes.data.csv'
# use the `open` command to create a file handle object, `csvfile`
# , that knows where the file is stored.
with open(file_path + file_name, newline='\n') as csvfile:
# Pass it to the `reader` function in the CSV library
f = csv.reader(csvfile, delimiter=',')
# `f` is the entire file in the Python runtime memory.
# To inspect the file, execute this short piece of a for loop
for row in f:
print(','.join(row))
Make a generator to stream the content of the file:
def stream_file(file_handle):
holder = []
for row in file_handle:
# Removes newline code \n, then fills up a holder
holder.append(row.rstrip("\n"))
yield holder
holder = []
with open(file_path + file_name, newline='\n') as handle:
Pass handle to a generator function stream_file,
which contains a for loop that iterates through the file
in handle row by row
for part in stream_file(handle):
print yield from the generator
print(part)
### Setting Up a Pattern for Filenames
- Download data and store it on your local machine
```python
import logging, urllib3, shutil
logging.basicConfig(level=logging.INFO)
def download_dataset(url, LOCAL_FILE_NAME):
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
c = urllib3.PoolManager()
with c.request("GET", url, preload_content=False) as res, open(
LOCAL_FILE_NAME, "wb"
) as out_file:
shutil.copyfileobj(res, out_file)
logging.info("Dowload completed.")
logging.info("Started download script")
URL = 'https://covid.ourworldindata.org/data/owid-covid-data.csv'
LOCAL_FILE_NAME = "dataset/owid-covid-data.csv"
download_dataset(URL, LOCAL_FILE_NAME)
Inspect the file and determine the number of rows:
$ wc -l dataset/owid-covid-data.csv
Inspect the first three lines of the CSV file to see if there is a header:
$ head -3 dataset/owid-covid-data.csv
Splitting a Single CSV File into Multiple CSV Files
For linux, you may need to first install the parallel command:
$ apt install parallel
Now let's split this file into multiple CSV files, each with 330 rows. You should end up with 100 CSV files, each of which has the header. If you use Linux or macOS, use the following command:
- `files` is a list that contains all the CSV filenames that are part of the original CSV, in no particular order
#### Creating a Streaming Dataset Object
- For the purposes of this example, `new_deaths` is selected as the target column:
```python
csv_dataset = tf.data.experimental.make_csv_dataset(
files,
header=True,
batch_size=5,
label_name='new_deaths',
num_epochs=1,
ignore_errors=True
)
To look at actual data, you'll need to use the csv_dataset object to iterate through data:
for features, target in csv_dataset.take(1):
print("Target: {}".format(target))
print("Features:")
for k, v in features.items():
print(" {!r:20s}: {}".format(k, v))
#### There are three steps to streaming the images. Let's look more closely:
- Create an `ImageDataGenerator` object and specify normalization parameters. Use the `rescale` parameter to indicate the normalization scale and the `validation_split` parameter to specify that 20% of data will be set aside for cross validation:
```python
datagen_kwargs = dict(rescale=1./255, validation_split=0.20)
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
**datagen_kwargs
)
Connect the ImageDataGenerator object to the data source and specify parameters to resize the images to a fixed dimension:
Prepare a map for indexing the labels. In this stepm you retrieve the index that the generator has assigned to each label and create a dictionary that maps it to the actual label name. The TensorFlow generator internally keeps track of labels from the directory name below data_dir. They can be retrieved through train_generator.class_indices, which returns a key-value pair of labels and indices. You can take advantage of this and reverse it to deploy the model for scoring. The model will output the index. To implement this reverse lookup, simply reverse the label dictionary generated by train_generator.class_indices:
label_idx = (train_generator.class_indices)
idx_labels = dict((v, k) for k, v in label_idx.items())
for image_batch, label_batch in train_generator:
print(image_batch.shape)
print(label_batch.shape)
break
Streaming File Content with a Generator
def download_dataset(url, LOCAL_FILE_NAME): urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) c = urllib3.PoolManager() with c.request("GET", url, preload_content=False) as res, open( LOCAL_FILE_NAME, "wb" ) as out_file: shutil.copyfileobj(res, out_file) logging.info("Dowload completed.")
logging.info("Started download script") URL = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv' LOCAL_FILE_NAME = "data/pima-indians-diabetes.data.csv" download_dataset(URL, LOCAL_FILE_NAME)
file_path = 'data/' file_name = 'pima-indians-diabetes.data.csv'
col_name = ['Pregnancies','Glucose','BloodPressure', 'SkinThickness','Insulin','BMI', 'DiabetesPedigree','Age','Outcome'] pd.read_csv(file_path + file_name, names = col_name)
with open(file_path + file_name, newline='\n') as handle:
Pass
handle
to a generator functionstream_file
,which contains a for loop that iterates through the file
in
handle
row by rowfor part in stream_file(handle):
print yield from the generator
Splitting a Single CSV File into Multiple CSV Files
Creating a File Pattern Object Using tf.io
tf.io
API simply leverages the glob library to generate a list of filenames that fit the pattern object:base_pattern = 'dataset' file_pattern = 'owid-covid-data-part*' files = tf.io.gfile.glob(base_pattern + '/' + file_pattern)
csv_dataset
object to iterate through data:Streaming a CSV Dataset
Using TensorFlow Image Generator
data_dir = tf.keras.utils.get_file( 'flower_photos', 'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz', untar=True )
ImageDataGenerator
object to the data source and specify parameters to resize the images to a fixed dimension:data_dir
. They can be retrieved through train_generator.class_indices, which returns a key-value pair of labels and indices. You can take advantage of this and reverse it to deploy the model for scoring. The model will output the index. To implement this reverse lookup, simply reverse the label dictionary generated by train_generator.class_indices:Streaming Cross-Validation Images
Inspecting Resized Images
image_batch, label_batch = next(iter(train_generator))
fig, axes = plt.subplots(8, 4, figsize=(10, 20)) axes = axes.flatten() for img, lbl, ax in zip(image_batch, labelbatch, axes): ax.imshow(img) label = np.argmax(lbl) label = idxlabels[label] ax.set_title(label) ax.axis('off') plt.show()