Closed Tims777 closed 9 months ago
The labeled dataset contains ~100000 entries. Just for the Google Places API this would result in costs around $5000 (300k requests x $0.017)
Another challenge is execution time. We decided to run our pipeline on a subset of 10k entries and e.g. the regionalatlas step took about 20 hours to run. The devs local machine might not be the best execution environment if we ever want to run on the full dataset.
The runtime problem could be fixed by #131. Additionaly we should implement local backups at least after each step or after processing x number of leads to avoid data loss in case of unexpected errors. This will inflate the issues size tho
Introduced regular data snapshots in #167
To train the AI model, we need to have a) revenue data (that should be predicted) b) correlated features (that we can use as the basis for our prediction)
To create a resepctive data set: Take the historical data (contains revenue data) and run our pipeline on it to enrich it with (hopefully) correlated features.
Precondition
We get a CSV file with all the lead data as we know it plus some form of revenue data.
Acceptance Criteria