Generators should be used to lazily instantiate and iterate over records when writing them to file. Additionally, logic for persisting dependencies to the local database should be changed to persist every time a record is generated, rather than in one go at the end. This is for compatibility with the generator pattern, but also avoids creating huge lists in memory for large record counts.
Design
This is a proposed implementation of how to use generators to yield records. This example will look at the instrument domain object only, but the same logic is applicable to all others.
creatable.py
Remove method persist_records and incorporate into persist_record instead:
def persist_record(self, record, table_name):
if self.__database is None:
self.establish_db_connection()
self.__database.persist(table_name, record)
self.__database.commit_changes()
Rename abstract method create to get_generator
sqlite_database.py
Remove persist_batch and replace with new method persist
def get_generator(self, record_count, start_id):
self.tickers = self.retrieve_column('tickers', "symbol")
for _ in range(start_id, start_id + record_count):
record = self.__create_record(_)
self.persist_record(
[str(record['instrument_id']),
record['ric'],
str(record['cusip']),
str(record['isin']),
str(record['market'])]
)
yield record
XYZ_builder.py
For each file builder class, change the build method implementation to use the generator object it is now being passed in place of a list of records through the data argument. For example, CSV Builder would be changed to:
class CSVBuilder(FileBuilder):
def build(self, file_number, data):
output_dir = self.get_output_directory()
file_name = self.get_file_name().format(f'{file_number:03}')
if not os.path.exists(output_dir):
os.mkdir(output_dir)
with open(os.path.join(output_dir, file_name), 'w+', newline='') as output_file:
first_record = next(data) # there may be a more convenient way to get fieldnames but this should work otherwise
fieldnames = first_record.keys()
dict_writer = csv.DictWriter(
output_file, restval="-", fieldnames=fieldnames, delimiter=','
)
dict_writer.writeheader()
dict_writer.writerow(first_record)
for record in data:
dict_writer.writerow(record)
if self.google_drive_connector_exists():
self.upload_to_google_drive(output_dir, file_name)
Multi-Processing
The multi-processing workflow will need to be changed such that instead of generating records and passing to a filebuild, the generator object is instantiated and passed to the filebuilder. This may result in an overhaul of the multiprocessing logic, since there will be no need for an intermediate queue of created records, possibly making the generator_process redundant.
At some point a redundant outer list is introduced - this may cause an issue with the change to using a generator, and it may be worth identifying the source of this problem to ensure it does not interfere with the new workflow.
## Documentation Changes
All docstrings should be updated to reflect new argument types and functionality
## Test Evidence
All tests should pass and all files should build as expected
## Validation in Develop
running `python src/app.py` should give expected output
Issue Description
Documentation on generator pattern: https://wiki.python.org/moin/Generators More readable article: https://realpython.com/introduction-to-python-generators/
Generators should be used to lazily instantiate and iterate over records when writing them to file. Additionally, logic for persisting dependencies to the local database should be changed to persist every time a record is generated, rather than in one go at the end. This is for compatibility with the generator pattern, but also avoids creating huge lists in memory for large record counts.
Design
This is a proposed implementation of how to use generators to yield records. This example will look at the
instrument
domain object only, but the same logic is applicable to all others.creatable.py
Remove method
persist_records
and incorporate intopersist_record
instead:Rename abstract method
create
toget_generator
sqlite_database.py
Remove
persist_batch
and replace with new methodpersist
instrument_factory.py
create
should be renamed and refactored:XYZ_builder.py For each file builder class, change the build method implementation to use the generator object it is now being passed in place of a list of records through the
data
argument. For example, CSV Builder would be changed to:Multi-Processing
The multi-processing workflow will need to be changed such that instead of generating records and passing to a filebuild, the generator object is instantiated and passed to the filebuilder. This may result in an overhaul of the multiprocessing logic, since there will be no need for an intermediate queue of created records, possibly making the
generator_process
redundant.