Open Jiros opened 4 years ago
With https://clinicaltrials.gov/ct2/download_studies?down_chunk=1
https://clinicaltrials.gov/ct2/download_studies?down_chunk=2
and so on, one can download all the raw data in xml format.
Only thing is that i cant find information about how many chunks there are :)
Source: https://clinicaltrials.gov/ct2/resources/download#DownloadMultipleRecords
I would suggest to start with loading the basic trial description data (only for trials relevant to COVID) from clinicaltrials.gov api endpoint (see cypher query attached)
load_covid_trials_clintrials_gov.txt
Unfortunately max_rnk for this query is limited to 1000, so when there are more than 1000 trials in total, the query should be splitted into several more specific queries (e.g. per trial phase) - 'expr=covid' to be updated.
After the basic info is loaded as nodes, some parts of it (e.g. PrimaryOutcomeMeasure) could be already parsed and linked to other data in the graph. Additional data from clinicaltrials.gov could then be loaded per trial (NCTId) with following query:
MATCH (ct:ClinicalTrial) call apoc.load.json('https://clinicaltrials.gov/api/query/full_studies?expr='+ct.NCTId[0]+'&fmt=json') yield value with value.FullStudiesResponse.FullStudies as studies unwind studies as study // add code to store in Neo return study
Description of the API: https://clinicaltrials.gov/api/gui/ref/api_urls
I have started to write the missing code (// add code to store Neo) unwinding the relevant info from JSON being returned. Do you suggest to add this additional info as nodes or as properties to the ClinicalTrial type nodes? I guess from your text it should be new nodes linked to ClinicalTrial nodes?
Do you suggest to add this additional info as nodes or as properties to the ClinicalTrial type nodes? I guess from your text it should be new nodes linked to ClinicalTrial nodes?
Yes, I suggest separate nodes linked to ClinicalTrial, especially the data that could be linked to other data in the graph: what comes to my mind - endpoints, inclusion/exclusion criteria. If you feel that some of the data is not relevant for linking, we could leave it as a properties for now and refactor the graph in the future if required to link this data.
I have started to write the missing code (// add code to store Neo) unwinding the relevant info from JSON being returned.
Awesome! Hint: To later integrate the data to the main graph, a docker image would be great. see https://github.com/covidgraph/data_template and https://github.com/covidgraph/motherlode for more informations. if you have any questions ping me (@tim.bleimehl:meet.dzd-ev.de).
ok, thanks. BTW there seems to be 1095 studies containing COVID. I have downloaded the JSON and will use that instead of the URL having the limit of 1000.
Hi, sorry but been busy with daily work and needed to get my head around the JSON input data. I have made a first attempt. For COVID studies I could not find any results, yet. PrimaryOutcomeMeasure are made as nodes, but the data is a bit messy. I have made my script in Jupyter notes (attached) using my own local graph for testing (that can be changed). Comments/feedback are more than welcome. I am happy to do more scripting extending/changing what I have made so far. EligibilityCriteria could be added as a property-the in/exclusion criteria tend to be non-standard too. Also appreciate feedback on the scripting :-) (it is not part of my daily work) @tim.bleimehl:meet.dzd-ev.de I think I need a bit of help if you need the suff differently. clinicaltrials.ipynb.zip
@KirstenLangendorf Great work! The json is a mess (why is every single attribute value wrapped in list :D ? ) but looks like you tamed it :sunglasses: I could setup a repository with a bit of boilerplate code (python/docker setup), where you can then paste your queries in. If that would help you? I would try to make it today in the afternoon or tomorrow morning.
@KirstenLangendorf Great work! The json is a mess (why is every single attribute value wrapped in list :D ? ) but looks like you tamed it 😎 I could setup a repository with a bit of boilerplate code (python/docker setup), where you can then paste your queries in. If that would help you? I would try to make it today in the afternoon or tomorrow morning.
Let me try it out. No rush - tomorrow is a Danish bank holiday. I think I will add Eligibility as nodes too. Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?
There is more data in the ClinicalTrials.gov - and hopefully also some study results at some point. What is the best way for me to get information about important data needed for the rest of the graph? will that be reading the use cases?
Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?
The ML/NLP Team is still in a experimentation/poc phase (as far as i can keep track of that atm). if you are interested in can invite you in the chat group.
There is more data in the ClinicalTrials.gov - and hopefully also some study results at some point. What is the best way for me to get information about important data needed for the rest of the graph? will that be reading the use cases?
afaik atm there is no standardized process to determine that. A discussion in the CovidGraph chat group would be the most purposeful way atm.
Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?
The ML/NLP Team is still in a experimentation/poc phase (as far as i can keep track of that atm). if you are interested in can invite you in the chat group. Yes please, thank you :-)
There is more data in the ClinicalTrials.gov - and hopefully also some study results at some point. What is the best way for me to get information about important data needed for the rest of the graph? will that be reading the use cases?
afaik atm there is no standardized process to determine that. A discussion in the CovidGraph chat group would be the most purposeful way atm. Ok will look out there.
Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?
The ML/NLP Team is still in a experimentation/poc phase (as far as i can keep track of that atm). if you are interested in can invite you in the chat group. Yes please, thank you :-)
Just saw you are already in the group :) (CovidGraph Data Analysis)
@KirstenLangendorf Great work! The json is a mess (why is every single attribute value wrapped in list :D ? ) but looks like you tamed it 😎 I could setup a repository with a bit of boilerplate code (python/docker setup), where you can then paste your queries in. If that would help you? I would try to make it today in the afternoon or tomorrow morning.
Hi Tim, I have time this weekend to work on covidgraph in case I should try out the python/docker setup.
Hi Kirsten,
you can start with https://github.com/covidgraph/data_template by clicking "Use this template" in the github webinterface. Basicly you have to copy your queries into https://github.com/covidgraph/data_template/blob/master/dataloader/main.py
If you need any further help with git,docker or python just ping me in the chat.
@KirstenLangendorf I can also help with the data loading template!
@KirstenLangendorf I can also help with the data loading template!
Thanks:-) I will start looking at the loading tomorrow. I am at work today.
Ok, couldn't help it. Had to look :-)
Documentation is made in https://github.com/covidgraph/data_template. I have made one: https://github.com/KirstenLangendorf/load_clinical_trials_gov and will fill in during tomorrow.
Do I just paste the queries I have in after line 22 (delete the rest)? in https://github.com/KirstenLangendorf/load_clinical_trials_gov/blob/master/dataloader/main.py
..ok I will read trough the instruction and revert once I have everything in my Github template.
@tim and @mpreusse I have now put the script on the dataloader folder: load_data and data_profile for the stats queries. I have written a bit on the ReadMe.
https://github.com/KirstenLangendorf/load_clinical_trials_gov
I need help on the rest since I not quite sure how to make it execute and publish in the right way.
@KirstenLangendorf cool! i will have a deeper look at it tomorrow, fork it and and try to bring it in an executable state.
@KirstenLangendorf that looks great! @motey tell me if I can help. Looks similar to e.g. the text fragger, there are now downloads but only Cypher queries. Pretty long ones though 😄
@KirstenLangendorf that looks great! @motey tell me if I can help. Looks similar to e.g. the text fragger, there are now downloads but only Cypher queries. Pretty long ones though 😄
I know the queries are long but It was to avoid calling the ClinicalTrials.gov json several times.
@KirstenLangendorf Hi Kirsten. i have done following things today:
data_profile
and load_data
to data_profile,cypher
and load_data.cypher
data_profile.cypher
If you could test the repo against a neo4j db? I am to lazy to setup a local neo4j instance with apoc :)
If the tests are successful i can integrate your script in the covidgraph dataloader pipeline :rocket:
@KirstenLangendorf Hi Kirsten. i have done following things today:
- renamed
data_profile
andload_data
todata_profile,cypher
andload_data.cypher
- Created a function in main.py to read in your queries from the file
data_profile.cypher
- created a main function in main.py to run your queries
- created a pipeline to build a docker image when there is a new release of the reposiory (aka git tag) and push the container to docker hub at covidgraph/data-clinical_trials_gov (see .github/workflows/build_container_prd.yml)
- Updated the readme.md
- Forked your whole repo to covidgraph/data_clinical-trials-gov and made you an admin (full rights). this was needed to allow me to add docker hub credentials and to have the repo in the same scheme as the others. if that is an issue for you, just let me know and we can find another solution
If you could test the repo against a neo4j db? I am to lazy to setup a local neo4j instance with apoc :)
If the tests are successful i can integrate your script in the covidgraph dataloader pipeline 🚀
Installed the Docker app.
In my terminal docker pull covidgraph/data-clinical_trials_gov
then writing this
docker build -t data-clinical_trials_gov .
returns this error:
error checking context: 'can't stat '/Users/Kirsten/.Trash''.
Tried to google it but couldn't find a fix. @motey Do you know what to do?
stupid question, but did you try it with sudo :=) ?
stupid question, but did you try it with sudo :=) ?
nope - I can try. It reported same error :-(
Couldn't see your message on Riot - encrypted
For more information and comprehensive guidance see the excellent article from Kirsten Langendorf - https://www.s-cubed-global.com/news/covidgraph-nerds-response-to-the-pandemic
Repo
https://github.com/covidgraph/data_clinical-trials-gov
Description
Suggested by - lynnehansen
Add data about clinical trials. There are a few databases where the results of clinical trials are published. The most relevant general purpose databases are clinicaltrials.gov and clinicaltrialsregister.eu
There might be two more datasources:
Data Sources
https://clinicaltrials.gov/ https://www.clinicaltrialsregister.eu/
Note
All clinical studies registered on https://clinicaltrials.gov/ related to covid19.
Dependencies
None