chanelcolgate / hydroelectric-project

0 stars 0 forks source link

Data Storage and Ingestion #19

Open chanelcolgate opened 2 years ago

chanelcolgate commented 2 years ago

Streaming File Content with a Generator

def download_dataset(url, LOCAL_FILE_NAME): urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) c = urllib3.PoolManager() with c.request("GET", url, preload_content=False) as res, open( LOCAL_FILE_NAME, "wb" ) as out_file: shutil.copyfileobj(res, out_file) logging.info("Dowload completed.")

logging.info("Started download script") URL = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv' LOCAL_FILE_NAME = "data/pima-indians-diabetes.data.csv" download_dataset(URL, LOCAL_FILE_NAME)

- The columns in this file are:
```python
['Pregnancies','Glucose','BloodPressure',
 'SkinThickness','Insulin','BMI',
 'DiabetesPedigree','Age','Outcome']

file_path = 'data/' file_name = 'pima-indians-diabetes.data.csv'

col_name = ['Pregnancies','Glucose','BloodPressure', 'SkinThickness','Insulin','BMI', 'DiabetesPedigree','Age','Outcome'] pd.read_csv(file_path + file_name, names = col_name)

- Since we want to stream this dataset, it is more convenient to read it as a CSV file and use the generator to output the rows. The way to do this is through the following code:
```python
import csv
file_path = 'data/'
file_name = 'pima-indians-diabetes.data.csv'

# use the `open` command to create a file handle object, `csvfile`
# , that knows where the file is stored.
with open(file_path + file_name, newline='\n') as csvfile:
  # Pass it to the `reader` function in the CSV library 
  f = csv.reader(csvfile, delimiter=',')
  # `f` is the entire file in the Python runtime memory.
  # To inspect the file, execute this short piece of a for loop
  for row in f:
    print(','.join(row))

with open(file_path + file_name, newline='\n') as handle:

Pass handle to a generator function stream_file,

which contains a for loop that iterates through the file

in handle row by row

for part in stream_file(handle):

print yield from the generator

print(part)
### Setting Up a Pattern for Filenames
- Download data and store it on your local machine
```python
import logging, urllib3, shutil
logging.basicConfig(level=logging.INFO)

def download_dataset(url, LOCAL_FILE_NAME):
  urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
  c = urllib3.PoolManager()
  with c.request("GET", url, preload_content=False) as res, open(
      LOCAL_FILE_NAME, "wb"
  ) as out_file:
    shutil.copyfileobj(res, out_file)
  logging.info("Dowload completed.")

logging.info("Started download script")
URL = 'https://covid.ourworldindata.org/data/owid-covid-data.csv'
LOCAL_FILE_NAME = "dataset/owid-covid-data.csv"
download_dataset(URL, LOCAL_FILE_NAME)

base_pattern = 'dataset' file_pattern = 'owid-covid-data-part*' files = tf.io.gfile.glob(base_pattern + '/' + file_pattern)

- `files` is a list that contains all the CSV filenames that are part of the original CSV, in no particular order
#### Creating a Streaming Dataset Object
- For the purposes of this example, `new_deaths` is selected as the target column:
```python
csv_dataset = tf.data.experimental.make_csv_dataset(
    files,
    header=True,
    batch_size=5,
    label_name='new_deaths',
    num_epochs=1,
    ignore_errors=True
)

data_dir = tf.keras.utils.get_file( 'flower_photos', 'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz', untar=True )

#### There are three steps to streaming the images. Let's look more closely:
- Create an `ImageDataGenerator` object and specify normalization parameters. Use the `rescale` parameter to indicate the normalization scale and the `validation_split` parameter to specify that 20% of data will be set aside for cross validation:
```python
datagen_kwargs = dict(rescale=1./255, validation_split=0.20)
train_datagen = tf.keras.preprocessing.image.ImageDataGenerator(
    **datagen_kwargs
)

image_batch, label_batch = next(iter(train_generator))

fig, axes = plt.subplots(8, 4, figsize=(10, 20)) axes = axes.flatten() for img, lbl, ax in zip(image_batch, labelbatch, axes): ax.imshow(img) label = np.argmax(lbl) label = idxlabels[label] ax.set_title(label) ax.axis('off') plt.show()