covidgraph / motherlode

Pipeline for running all dataloader scripts for covidgraph in a controlled manner.
https://covidgraph.org
MIT License
3 stars 1 forks source link

clinicaltrials.gov #40

Open Jiros opened 4 years ago

Jiros commented 4 years ago

For more information and comprehensive guidance see the excellent article from Kirsten Langendorf - https://www.s-cubed-global.com/news/covidgraph-nerds-response-to-the-pandemic

Repo

https://github.com/covidgraph/data_clinical-trials-gov

Description

Suggested by - lynnehansen

Add data about clinical trials. There are a few databases where the results of clinical trials are published. The most relevant general purpose databases are clinicaltrials.gov and clinicaltrialsregister.eu

There might be two more datasources:

Data Sources

https://clinicaltrials.gov/ https://www.clinicaltrialsregister.eu/

Note

All clinical studies registered on https://clinicaltrials.gov/ related to covid19.

Dependencies

None

motey commented 4 years ago

With https://clinicaltrials.gov/ct2/download_studies?down_chunk=1 https://clinicaltrials.gov/ct2/download_studies?down_chunk=2 and so on, one can download all the raw data in xml format. Only thing is that i cant find information about how many chunks there are :)

Source: https://clinicaltrials.gov/ct2/resources/download#DownloadMultipleRecords

paltusplintus commented 4 years ago

I would suggest to start with loading the basic trial description data (only for trials relevant to COVID) from clinicaltrials.gov api endpoint (see cypher query attached)

load_covid_trials_clintrials_gov.txt

Unfortunately max_rnk for this query is limited to 1000, so when there are more than 1000 trials in total, the query should be splitted into several more specific queries (e.g. per trial phase) - 'expr=covid' to be updated.

After the basic info is loaded as nodes, some parts of it (e.g. PrimaryOutcomeMeasure) could be already parsed and linked to other data in the graph. Additional data from clinicaltrials.gov could then be loaded per trial (NCTId) with following query:

MATCH (ct:ClinicalTrial) call apoc.load.json('https://clinicaltrials.gov/api/query/full_studies?expr='+ct.NCTId[0]+'&fmt=json') yield value with value.FullStudiesResponse.FullStudies as studies unwind studies as study // add code to store in Neo return study

Description of the API: https://clinicaltrials.gov/api/gui/ref/api_urls

KirstenLangendorf commented 4 years ago

I have started to write the missing code (// add code to store Neo) unwinding the relevant info from JSON being returned. Do you suggest to add this additional info as nodes or as properties to the ClinicalTrial type nodes? I guess from your text it should be new nodes linked to ClinicalTrial nodes?

paltusplintus commented 4 years ago

Do you suggest to add this additional info as nodes or as properties to the ClinicalTrial type nodes? I guess from your text it should be new nodes linked to ClinicalTrial nodes?

Yes, I suggest separate nodes linked to ClinicalTrial, especially the data that could be linked to other data in the graph: what comes to my mind - endpoints, inclusion/exclusion criteria. If you feel that some of the data is not relevant for linking, we could leave it as a properties for now and refactor the graph in the future if required to link this data.

motey commented 4 years ago

I have started to write the missing code (// add code to store Neo) unwinding the relevant info from JSON being returned.

Awesome! Hint: To later integrate the data to the main graph, a docker image would be great. see https://github.com/covidgraph/data_template and https://github.com/covidgraph/motherlode for more informations. if you have any questions ping me (@tim.bleimehl:meet.dzd-ev.de).

KirstenLangendorf commented 4 years ago

ok, thanks. BTW there seems to be 1095 studies containing COVID. I have downloaded the JSON and will use that instead of the URL having the limit of 1000.

KirstenLangendorf commented 4 years ago

Hi, sorry but been busy with daily work and needed to get my head around the JSON input data. I have made a first attempt. For COVID studies I could not find any results, yet. PrimaryOutcomeMeasure are made as nodes, but the data is a bit messy. I have made my script in Jupyter notes (attached) using my own local graph for testing (that can be changed). Comments/feedback are more than welcome. I am happy to do more scripting extending/changing what I have made so far. EligibilityCriteria could be added as a property-the in/exclusion criteria tend to be non-standard too. Also appreciate feedback on the scripting :-) (it is not part of my daily work) @tim.bleimehl:meet.dzd-ev.de I think I need a bit of help if you need the suff differently. clinicaltrials.ipynb.zip

motey commented 4 years ago

@KirstenLangendorf Great work! The json is a mess (why is every single attribute value wrapped in list :D ? ) but looks like you tamed it :sunglasses: I could setup a repository with a bit of boilerplate code (python/docker setup), where you can then paste your queries in. If that would help you? I would try to make it today in the afternoon or tomorrow morning.

KirstenLangendorf commented 4 years ago

@KirstenLangendorf Great work! The json is a mess (why is every single attribute value wrapped in list :D ? ) but looks like you tamed it 😎 I could setup a repository with a bit of boilerplate code (python/docker setup), where you can then paste your queries in. If that would help you? I would try to make it today in the afternoon or tomorrow morning.

Let me try it out. No rush - tomorrow is a Danish bank holiday. I think I will add Eligibility as nodes too. Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?

There is more data in the ClinicalTrials.gov - and hopefully also some study results at some point. What is the best way for me to get information about important data needed for the rest of the graph? will that be reading the use cases?

motey commented 4 years ago

Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?

The ML/NLP Team is still in a experimentation/poc phase (as far as i can keep track of that atm). if you are interested in can invite you in the chat group.

There is more data in the ClinicalTrials.gov - and hopefully also some study results at some point. What is the best way for me to get information about important data needed for the rest of the graph? will that be reading the use cases?

afaik atm there is no standardized process to determine that. A discussion in the CovidGraph chat group would be the most purposeful way atm.

KirstenLangendorf commented 4 years ago

Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?

The ML/NLP Team is still in a experimentation/poc phase (as far as i can keep track of that atm). if you are interested in can invite you in the chat group. Yes please, thank you :-)

There is more data in the ClinicalTrials.gov - and hopefully also some study results at some point. What is the best way for me to get information about important data needed for the rest of the graph? will that be reading the use cases?

afaik atm there is no standardized process to determine that. A discussion in the CovidGraph chat group would be the most purposeful way atm. Ok will look out there.

motey commented 4 years ago

Saw the presentation by Martin Preusse and it seems that you are using Machine Learning type tools to combine messy data. Which tools are you using?

The ML/NLP Team is still in a experimentation/poc phase (as far as i can keep track of that atm). if you are interested in can invite you in the chat group. Yes please, thank you :-)

Just saw you are already in the group :) (CovidGraph Data Analysis)

KirstenLangendorf commented 4 years ago

@KirstenLangendorf Great work! The json is a mess (why is every single attribute value wrapped in list :D ? ) but looks like you tamed it 😎 I could setup a repository with a bit of boilerplate code (python/docker setup), where you can then paste your queries in. If that would help you? I would try to make it today in the afternoon or tomorrow morning.

Hi Tim, I have time this weekend to work on covidgraph in case I should try out the python/docker setup.

motey commented 4 years ago

Hi Kirsten,

you can start with https://github.com/covidgraph/data_template by clicking "Use this template" in the github webinterface. Basicly you have to copy your queries into https://github.com/covidgraph/data_template/blob/master/dataloader/main.py

If you need any further help with git,docker or python just ping me in the chat.

mpreusse commented 4 years ago

@KirstenLangendorf I can also help with the data loading template!

KirstenLangendorf commented 4 years ago

@KirstenLangendorf I can also help with the data loading template!

Thanks:-) I will start looking at the loading tomorrow. I am at work today.

Ok, couldn't help it. Had to look :-)
Documentation is made in https://github.com/covidgraph/data_template. I have made one: https://github.com/KirstenLangendorf/load_clinical_trials_gov and will fill in during tomorrow.

Do I just paste the queries I have in after line 22 (delete the rest)? in https://github.com/KirstenLangendorf/load_clinical_trials_gov/blob/master/dataloader/main.py

..ok I will read trough the instruction and revert once I have everything in my Github template.

KirstenLangendorf commented 4 years ago

@tim and @mpreusse I have now put the script on the dataloader folder: load_data and data_profile for the stats queries. I have written a bit on the ReadMe.

https://github.com/KirstenLangendorf/load_clinical_trials_gov

I need help on the rest since I not quite sure how to make it execute and publish in the right way.

motey commented 4 years ago

@KirstenLangendorf cool! i will have a deeper look at it tomorrow, fork it and and try to bring it in an executable state.

mpreusse commented 4 years ago

@KirstenLangendorf that looks great! @motey tell me if I can help. Looks similar to e.g. the text fragger, there are now downloads but only Cypher queries. Pretty long ones though 😄

KirstenLangendorf commented 4 years ago

@KirstenLangendorf that looks great! @motey tell me if I can help. Looks similar to e.g. the text fragger, there are now downloads but only Cypher queries. Pretty long ones though 😄

I know the queries are long but It was to avoid calling the ClinicalTrials.gov json several times.

motey commented 4 years ago

@KirstenLangendorf Hi Kirsten. i have done following things today:

If you could test the repo against a neo4j db? I am to lazy to setup a local neo4j instance with apoc :)

If the tests are successful i can integrate your script in the covidgraph dataloader pipeline :rocket:

KirstenLangendorf commented 4 years ago

@KirstenLangendorf Hi Kirsten. i have done following things today:

  • renamed data_profile and load_data to data_profile,cypher and load_data.cypher
  • Created a function in main.py to read in your queries from the file data_profile.cypher
  • created a main function in main.py to run your queries
  • created a pipeline to build a docker image when there is a new release of the reposiory (aka git tag) and push the container to docker hub at covidgraph/data-clinical_trials_gov (see .github/workflows/build_container_prd.yml)
  • Updated the readme.md
  • Forked your whole repo to covidgraph/data_clinical-trials-gov and made you an admin (full rights). this was needed to allow me to add docker hub credentials and to have the repo in the same scheme as the others. if that is an issue for you, just let me know and we can find another solution

If you could test the repo against a neo4j db? I am to lazy to setup a local neo4j instance with apoc :)

If the tests are successful i can integrate your script in the covidgraph dataloader pipeline 🚀

Installed the Docker app. In my terminal docker pull covidgraph/data-clinical_trials_gov then writing this docker build -t data-clinical_trials_gov .
returns this error: error checking context: 'can't stat '/Users/Kirsten/.Trash''. Tried to google it but couldn't find a fix. @motey Do you know what to do?

motey commented 4 years ago

stupid question, but did you try it with sudo :=) ?

KirstenLangendorf commented 4 years ago

stupid question, but did you try it with sudo :=) ?

nope - I can try. It reported same error :-(

Couldn't see your message on Riot - encrypted