galatea-associates / fuse-test-data-gen

Repository for the Galatea internal data generator tool, used for generating domain data for POCs
0 stars 0 forks source link

Use a generator to create domain objects #211

Open WilfGala opened 4 years ago

WilfGala commented 4 years ago

Issue Description

Documentation on generator pattern: https://wiki.python.org/moin/Generators More readable article: https://realpython.com/introduction-to-python-generators/

Generators should be used to lazily instantiate and iterate over records when writing them to file. Additionally, logic for persisting dependencies to the local database should be changed to persist every time a record is generated, rather than in one go at the end. This is for compatibility with the generator pattern, but also avoids creating huge lists in memory for large record counts.

Design

This is a proposed implementation of how to use generators to yield records. This example will look at the instrument domain object only, but the same logic is applicable to all others.

creatable.py

Remove method persist_records and incorporate into persist_record instead:

    def persist_record(self, record, table_name):
        if self.__database is None:
            self.establish_db_connection()
        self.__database.persist(table_name, record)
        self.__database.commit_changes()

Rename abstract method create to get_generator

sqlite_database.py

Remove persist_batch and replace with new method persist

    def persist(self, table_name, record):
        record = self.format_list_for_insertion(record)
        prepared_row = ",".join(record)
        query = " ".join(("INSERT INTO", table_name, "VALUES", prepared_row))
        self.__connection.execute(query)

instrument_factory.py

create should be renamed and refactored:

    def get_generator(self, record_count, start_id):

        self.tickers = self.retrieve_column('tickers', "symbol")

        for _ in range(start_id, start_id + record_count):
            record = self.__create_record(_)
            self.persist_record(
                [str(record['instrument_id']),
                 record['ric'],
                 str(record['cusip']),
                 str(record['isin']),
                 str(record['market'])]
            )
            yield record

XYZ_builder.py For each file builder class, change the build method implementation to use the generator object it is now being passed in place of a list of records through the data argument. For example, CSV Builder would be changed to:

class CSVBuilder(FileBuilder):

    def build(self, file_number, data):
        output_dir = self.get_output_directory()
        file_name = self.get_file_name().format(f'{file_number:03}')

        if not os.path.exists(output_dir):
            os.mkdir(output_dir)

        with open(os.path.join(output_dir, file_name), 'w+', newline='') as output_file:
            first_record = next(data) # there may be a more convenient way to get fieldnames but this should work otherwise
            fieldnames = first_record.keys()
            dict_writer = csv.DictWriter(
                output_file, restval="-", fieldnames=fieldnames, delimiter=','
            )
            dict_writer.writeheader()
            dict_writer.writerow(first_record)
            for record in data:
                dict_writer.writerow(record)

        if self.google_drive_connector_exists():
            self.upload_to_google_drive(output_dir, file_name)

Multi-Processing

The multi-processing workflow will need to be changed such that instead of generating records and passing to a filebuild, the generator object is instantiated and passed to the filebuilder. This may result in an overhaul of the multiprocessing logic, since there will be no need for an intermediate queue of created records, possibly making the generator_process redundant.


At some point a redundant outer list is introduced - this may cause an issue with the change to using a generator, and it may be worth identifying the source of this problem to ensure it does not interfere with the new workflow.

## Documentation Changes
All docstrings should be updated to reflect new argument types and functionality

## Test Evidence
All tests should pass and all files should build as expected

## Validation in Develop
running `python src/app.py` should give expected output