coderxio / sagerx

Open drug data pipelines curated by pharmacists.
https://coderx.io/sagerx
Other
49 stars 13 forks source link

Make rate limiting work with RxClass #331

Open jrlegrand opened 1 week ago

jrlegrand commented 1 week ago

Problem Statement

See related branch jrlegrand/rxclass-rework.

RxClass API has a rate limit of 20 calls / second.

There's about 123,246 API calls.

[2024-11-22, 01:14:28 CST] {logging_mixin.py:137} INFO - URL List created of length: 123246

I'm no mathematician, but 20 calls / second x 60 seconds / minute = 1200 calls / minute. 123,246 / 1200 calls / minute = 103 minutes or exactly 1 hour and 43 minutes.

When I run my branch locally, it runs for 1 hour and 43 minutes and errors out with the error below.

[2024-11-22, 02:58:01 CST] {taskinstance.py:1768} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/decorators/base.py", line 217, in execute
    return_value = super().execute(context)
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 175, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/operators/python.py", line 192, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/opt/airflow/dags/rxclass/dag_tasks.py", line 41, in extract_rxclass
    rxclasses = response['rxclassDrugInfoList']['rxclassDrugInfo']
KeyError: 'rxclassDrugInfoList'

As I'm writing this, I think the issue is more about what happens after the API calls have completed - seeing as the time it ran is appropriate based on my #math above and the error seems to be about a KeyError.

Either way, this is not working - maybe the problem isn't with my rate limiting code, but either way it would be great to have other eyes on this.

Criteria for Success

RxClass DAG runs in about 1 hour 45 minutes and does not error out.

Additional Information

https://lhncbc.nlm.nih.gov/RxNav/TermsofService.html

saywurdson commented 6 days ago

@jrlegrand it looks like the reason why the code is failing is because it does not handle cases where the API response lacks the 'rxclassDrugInfoList' key, (meaning there is no class data associated with the concept) leading to a KeyError. The problem is happening in the process_concept function. We just need to figure out how to handle this situation more elegantly.

Potential solutions off the top of my head:

  1. We can add some kind of logging to identify the concepts that don't have class data in the terminal
  2. We can just skip these concepts
  3. We can add them to the final table but all the className and other class info will be blank.

Let me know what you think would be the best solution moving forward and I'll see how I can fix the code so that it works

jrlegrand commented 6 days ago

I pushed up some code to the branch. It works - see my most recent commit message. It runs in 2.5 hours which could be optimized I'm sure. I noticed when the key doesn't exist, it returns an empty object {}. Also I spot checked against RxClass for may_treat "Multiple Myeloma" and SageRx had 3 fewer IN drugs than the RxClass UI online. These ones were missing in SageRx. IN 3639 doxorubicin IN 612937 interferon alfa-n3 IN 72257 interferon beta-1b https://mor.nlm.nih.gov/RxClass/search?query=Multiple%20Myeloma%7CDISEASE&searchBy=class&sourceIds=&drugSources=atc1-4%7Catcprod%2Cepc%7Cdailymed%2Cdisease%7Cmedrt%2Cchem%7Cdailymed%2Cmoa%7Cdailymed%2Cpe%7Cdailymed%2Cpk%7Cmedrt%2Ctc%7Cfmtsme%2Cva%7Cva%2Cdispos%7Csnomedct%2Cstruct%7Csnomedct%2Ctherap%7Csnomedct%2Cschedule%7Crxnorm

jrlegrand commented 5 days ago

Hmm... I don't see an IN listed for the may_treat Multiple Myeloma relationship in the API (I'm only seeing the PIN) so maybe it's not an issue with our code. Maybe it's some weird thing with RxClass UI?

API https://rxnav.nlm.nih.gov/REST/rxclass/class/byRxcui.json?rxcui=612937

NOTE: the only may-treat relation is a PIN with RXCUI 72258.

I suspect what the RxClass UI is doing is mapping PIN to IN if an IN doesn't already exist in the list. In other words, I see a lot of PINs that kind of have "sister" INs... except for these 3. They only show up as PINs. But you can map from PIN to IN to get the IN if that's preferred.

RxClass https://mor.nlm.nih.gov/RxClass/search?query=elotuzumab&searchBy=drug&sourceIds=&drugSources=atc1-4%7Catcprod%2Cepc%7Cdailymed%2Cdisease%7Cmedrt%2Cchem%7Cdailymed%2Cmoa%7Cdailymed%2Cpe%7Cdailymed%2Cpk%7Cmedrt%2Ctc%7Cfmtsme%2Cva%7Cva%2Cdispos%7Csnomedct%2Cstruct%7Csnomedct%2Ctherap%7Csnomedct%2Cschedule%7Crxnorm

RxNav https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=72258

jrlegrand commented 5 days ago

Number of rows by rela_source image

saywurdson commented 4 days ago

https://github.com/coderxio/sagerx/pull/333 - potential optimization