Optimized current dag_tasks file by doing the following:
changed the rxclass_df.append() from appending a new dataframe with each append to collecting all the created rows in a list and then creating the final dataframe from that list
used the pandas deduplication method to remove duplicates that @jrlegrand mentioned in issue 331
tried speeding up api requests by making the requests asynchronous while keeping the rate to 20 calls per second
Rationale
@jrlegrand was able to fix the code so that it runs, but mentioned that the code ran slowly. I was able to confirm this and the adjustments make reduced the time to create sagerx_lake.rxclass by more than half
Tests
What testing did you do?
Did a quick QA by counting the number of rows in the table before and after the change and also counting the rows grouped by rela_source and verifying that the counts matched @jrlegrand 's counts
"Resolves" #331
Explanation
Optimized current dag_tasks file by doing the following:
rxclass_df.append()
from appending a new dataframe with each append to collecting all the created rows in a list and then creating the final dataframe from that listRationale
@jrlegrand was able to fix the code so that it runs, but mentioned that the code ran slowly. I was able to confirm this and the adjustments make reduced the time to create
sagerx_lake.rxclass
by more than halfTests