Repair BDC pipeline runs with forceRefresh=False - Githubissues

amosproj / amos2023ws06-sales-lead-qualifier

MIT License

4 stars 0 forks source link

Repair BDC pipeline runs with forceRefresh=False #235

Closed Tims777 closed 9 months ago

Tims777 commented 10 months ago

When running the pipeline run_all_steps.json (but with forceRefresh set to false everywhere), several errors happen in the different steps. These need to be fixed or the affected pipeline steps should be taken out.

List of errors

Ordered by severity

at the end of regional atlas step: | ERROR | pipeline.py:57 | Step Regional_Atlas failed! Columns must be same length as key
all the time in GPT and insights enhancer steps: | ERROR | s3_repository.py:212 | Error loading review from S3 with id ChIJkdTnnsMzs1IRlCF2m6bKYsU. Error: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist. (might indicate a problem with the Google step)
address scraping hangs: Getting addresses from custom domains...: 47%|████████████████████████▏ | 241/518 [17:00<19:32, 4.23s/it]

Note

The current pipeline run_all_steps.json should be changed to have forceRefresh: false set everywhere. The current configuration can optionally be copied to a new pipeline config force_refresh_all_steps.json.

Acceptance Criteria

It is possible to run the run_all_steps.json with forceRefresh set to false everywhere
- All steps complete successfully (i.e. their output will appear in the enriched.csv file)
- If steps cannot be repaired easily, they should be excluded from the pipeline and the problem should be documented

luccalb commented 10 months ago

The issue with regionalatlas seems to be related to the hashtables, as the error only occurs when running the first time (without any hashtables)

luccalb commented 10 months ago

The address scraping step was a very early experiment and is not "production ready". It was never meant to end up in the final pipeline, as we get the address from google. I'll creat a special demo pipeline config for the BDC.

luccalb commented 10 months ago

The GPT Errors look worse than they are. It just means that the data was not present in the cache files. I adjusted the error logging.

Tims777 commented 10 months ago

The GPT Errors look worse than they are. It just means that the data was not present in the cache files. I adjusted the error logging.

Good to know. However, the error message is still appearing:

Running sentiment analysis on reviews:  78%|████████████████████████████████████████████████████████████                 | 928/1190
[04:11<02:55,  1.49it/s]2024-02-05 22:20:13,197 |    ERROR | s3_repository.py:212 | Error loading review from S3 with id ChIJFzDNMdC_xkcRmSfqst9o1g8.
Error: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

luccalb commented 10 months ago

The GPT Errors look worse than they are. It just means that the data was not present in the cache files. I adjusted the error logging.

Good to know. However, the error message is still appearing:

Running sentiment analysis on reviews:  78%|████████████████████████████████████████████████████████████                 | 928/1190
[04:11<02:55,  1.49it/s]2024-02-05 22:20:13,197 |    ERROR | s3_repository.py:212 | Error loading review from S3 with id ChIJFzDNMdC_xkcRmSfqst9o1g8.
Error: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

This seems to be S3 specific, I'll check again.

luccalb commented 10 months ago

The GPT Errors look worse than they are. It just means that the data was not present in the cache files. I adjusted the error logging.

Good to know. However, the error message is still appearing:
Running sentiment analysis on reviews:  78%|████████████████████████████████████████████████████████████                 | 928/1190
[04:11<02:55,  1.49it/s]2024-02-05 22:20:13,197 |    ERROR | s3_repository.py:212 | Error loading review from S3 with id ChIJFzDNMdC_xkcRmSfqst9o1g8.
Error: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
This seems to be S3 specific, I'll check again.

It's just ungraceful error handling, when a google place has no reviews, we don't save any to S3. The sentiment analyzer just assumes where the review file should be but cant find it. The sentiment score will be None in that case.