bcgsc / pori_graphkb_loader

The Loaders for GraphKB. Imports content from external sources via the GraphKB REST API
https://bcgsc.github.io/pori
GNU General Public License v3.0
6 stars 4 forks source link

Bugfix/kbdev 1158 investigating gkb clinicaltrial org loader errors #135

Closed sshugsc closed 9 months ago

sshugsc commented 9 months ago

update clinical trial loader to use the new clinicaltrial.org api

mathieulemieux commented 9 months ago

@sshugsc , please post on the ticket the new log you're getting and add a link in the description. Also, check in the log if everything looks ok. I'll duplicate here a question I posted on KBDEV-1158 so we don't loose track: is it loading all the trials, only the newer ones, or the user can choose at run time?

sshugsc commented 9 months ago

~log file: clinicaltrialsgov_test1.logs.txt loaded data could be checked in db test_shirley1 It only loads the newer ones for now.~

creisle commented 9 months ago

There is ~500k CT at clinicaltrials.gov, and ~70k in GKB prod, so we should be loading much more than 1000, unless we are only uploading the last 2 weeks, but then, why 1000 records? @sshugsc, can you please find how many records we should usually get?

Most of the trials are not cancer-relevant, we only load cancer relevant trials. Additionally we only load interventional trials (something where they try to change a condition), a lot of trials are observational only which isn't really useful to us

In the log file, we can see {"error":0,"success":1000} and "ClinicalTrial":{"created":213}, with 622 warn messages; are we sure these 1000 records are actually as many succes? What is happening to the ClinicalTrial record whan a Disease term is not found?

again.... been a while but afaik the clinical trial record will be created irrespective of the disease. However if the disease term or drug term is not found then it isn't linked to it in the DB. There's a couple reasons for not just creating everything we see as a durg/disease term but the biggest one is that the naming is super inconsistent in clinicaltrails.gov. I remember we have numbers for this in the PORI paper

Dosumentation about these fields is needed. @sshugsc , maybe by looking on the website, by asking via email, or by pigning @creisle ?

https://classic.clinicaltrials.gov/ct2/resources/rss

mathieulemieux commented 9 months ago

Ty @creisle for the link and explanations; I knew we were filtering the records but it's nice to have more context. The http request has a "count=10000" but something is limiting the results to 1000. @sshugsc , maybe there is pagination now?

github-actions[bot] commented 9 months ago

Unit Test Results

0 files  ±0  0 suites  ±0   0s :stopwatch: ±0s 0 tests ±0  0 :heavy_check_mark: ±0  0 :zzz: ±0  0 :x: ±0 

Results for commit f1a9c26d. ± Comparison against base commit b68e6a05.

sshugsc commented 9 months ago

Ty @creisle for the link and explanations; I knew we were filtering the records but it's nice to have more context. The http request has a "count=10000" but something is limiting the results to 1000. @sshugsc , maybe there is pagination now?

Thank you @creisle for the explanations! @mathieulemieux yes, there is a pageSize limit to 1000 mentioned on their web https://clinicaltrials.gov/data-api/api .

@sshugsc , please post on the ticket the new log you're getting and add a link in the description. Also, check in the log if everything looks ok. I'll duplicate here a question I posted on KBDEV-1158 so we don't loose track: is it loading all the trials, only the newer ones, or the user can choose at run time?

Discussed with @elewis2 , planned to keep this PR stick to load all the trials (after filters) with clinicaltrial.gov api. @mathieulemieux New commit is pushed, ready for review.

sshugsc commented 9 months ago

log file: clinicaltrialsgov.logs.txt loaded data could be checked in db test_shirley5

mathieulemieux commented 9 months ago

Next steps:

github-actions[bot] commented 9 months ago

Unit Test Results

0 files  ±0  0 suites  ±0   0s :stopwatch: ±0s 0 tests ±0  0 :heavy_check_mark: ±0  0 :zzz: ±0  0 :x: ±0 

Results for commit 351c45b1. ± Comparison against base commit b68e6a05.

github-actions[bot] commented 9 months ago

Unit Test Results

0 files  ±0  0 suites  ±0   0s :stopwatch: ±0s 0 tests ±0  0 :heavy_check_mark: ±0  0 :zzz: ±0  0 :x: ±0 

Results for commit 351c45b1. ± Comparison against base commit b68e6a05.