Let our pipeline run on historical data

amosproj / amos2023ws06-sales-lead-qualifier

MIT License

4 stars 0 forks source link

Let our pipeline run on historical data #150

Closed Tims777 closed 9 months ago

Tims777 commented 10 months ago

To train the AI model, we need to have a) revenue data (that should be predicted) b) correlated features (that we can use as the basis for our prediction)

To create a resepctive data set: Take the historical data (contains revenue data) and run our pipeline on it to enrich it with (hopefully) correlated features.

Precondition

We get a CSV file with all the lead data as we know it plus some form of revenue data.

Acceptance Criteria

Enriched data is stored as a CSV on S3.
Andi has received the link to that file.

luccalb commented 9 months ago

The labeled dataset contains ~100000 entries. Just for the Google Places API this would result in costs around $5000 (300k requests x $0.017)

Another challenge is execution time. We decided to run our pipeline on a subset of 10k entries and e.g. the regionalatlas step took about 20 hours to run. The devs local machine might not be the best execution environment if we ever want to run on the full dataset.

luccalb commented 9 months ago

The runtime problem could be fixed by #131. Additionaly we should implement local backups at least after each step or after processing x number of leads to avoid data loss in case of unexpected errors. This will inflate the issues size tho

luccalb commented 9 months ago

Introduced regular data snapshots in #167