Open karafecho opened 6 years ago
Swagger UI can be found at: https://ddcr.renci.org/apidocs/
Looks good to me. Thanks.
Based on today's meeting with @karafecho
@xu-hao : Please see the requests below.
The text/tabular view for the 'get feature association' function needs to be reformatted similar to the 'get associations to all features' function.
Please include the following links at the top of the API, under 'ICEES API documentation':
patient-level tables, column headers and binning strategy,
visit-level tables, column headers and binning strategy
CORRECTION: Instead of links to individual files, please add a link to the ICEES folder.
Reorder functionalities as follows: (1) 'get features', (2) 'get feature association', (3) 'get associations to all features'
For consistency, please make the following word changes to the description of each functionality:
a. Replace 'Create a new cohort by a set of feature variables and set the cohort id. Even if a cohort has already been created for this set before, a new cohort is created.' with 'Cohort discovery. Users define a cohort using any number of defined feature variables as input parameters, and the service returns a sample size. A new cohort is created even if a cohort was previously created using the same input parameters.'
b. Replace 'get features' with 'Feature-rich cohort discovery: users select a predefined cohort as the input parameter, and the service returns a profile of that cohort in terms of all feature variables.
c. Replace 'get feature association' with 'Hypothesis-driven 2 x 2 feature associations: users select a predefined cohort and two feature variables, and the service returns a 2 x 2 feature table with a corresponding Chi Square statistic and P value.'
d. Replace 'get associations to all features' with 'Exploratory 1 X N feature associations: users select a predefined cohort and a feature variable of interest, and the service returns a 1 x N feature table with corrected Chi Square statistics and associated P values.'
Is it possible to move the 'response content type' to the middle column? It's easy to overlook in the current position.
Can we include documentation or a dictionary for versioning of tables?
If you'd prefer that I complete the above tasks, just let me know.
@xu-hao : Note that I made a correction to item 2.
@xu-hao : In terms of prioritization, I'd consider items 2, 3, and 4 plus the acknowledgement text below to be the most important.
@xu-hao : One more thing: please add the text below beneath the text for the terms and conditions of the service. Thanks!
"We kindly request that Translator team members provide proper attribution for any products (e.g., manuscripts, podium presentations, software) derived from work related to Green Team's clinical datasets. Attribution should include acknowledgement of the funder (National Center for Advancing Translational Sciences [NCATS], Biomedical Data Translator Program awards, OT3TR002020 and OT2TR002514), the North Carolina Translational and Clinical Sciences (NC TraCS) Institute (NCATS, Center for Translational Science Award, UL1TR002489), UNC Hospitals and Health Care System, and all Green Team members who contributed to the work."
@xu-hao : I added an additional grant number to the text above. @xu-hao @ColinKCurtis : As discussed today, the addition of the above text is high priority and should be completed prior to the hackathon. Thanks!
@xu-hao : I believe you used panda.qcut for binning. I'm working with the data in SAS JMP and finding that the quintiles do not appear to align with yours. (See screenshots below for example distributions.) I think you binned by 'values' whereas I binned by 'population distribution'. I'm not sure which approach is best, but I do want to better understand the binning strategy that you used.
@karafecho It seems that the data I get from qcut matches the data on your screenshot:
For example, for EstResidentialDensity
:
Categories (5, interval[float64]): [(-0.001, 1250.0] < (1250.0, 1688.0] < (1688.0, 2204.0] <
(2204.0, 2662.0] < (2662.0, 9653.0]]
matches the Quantiles
tab in your screenshot
count 23093.000000
mean 2124.561815
std 1146.372076
min 0.000000
25% 1387.000000
50% 1987.000000
75% 2545.000000
max 9653.000000
matches the Summary Statistics
tab in your screenshot
Numbers of samples in each category is not shown directly on the screenshot, so I cannot compare with the result of pandas. If there is a mismatch, one possible cause of mismatch could be that SAS JMP and pandas handles values on the boundary differently. If you look at the output of pandas, all the categories are left exclusive and right inclusive. If SAS JMP has a different algorithm, then it will give different numbers.
Thanks, Hao. I'll look into this issue.
Note that I changed the links in item 2 above and added a second grant number and revised text to the proposed citation.
For 'get definition of a cohort': I think we should return the sample size with the features that define a cohort.
@ColinKCurtis : Could you please take at look at the following feature request?
@karafecho
It is already formatted similar to 'get associations to all features'.
+-----------------+-----------------------+------------------------+---------+
| feature | AgeStudyStart = 0-2 | AgeStudyStart <> 0-2 | |
+=================+=======================+========================+=========+
| ObesityBMI = 0 | 2013 8.72% | 21076 91.28% | 23089 |
| | 100.00% 8.72% | 99.98% 91.27% | 99.98% |
+-----------------+-----------------------+------------------------+---------+
| ObesityBMI <> 0 | 0 0.00% | 4 100.00% | 4 |
| | 0.00% 0.00% | 0.02% 0.02% | 0.02% |
+-----------------+-----------------------+------------------------+---------+
| | 2013 | 21080 | 23093 |
| | 8.72% | 91.28% | 100.00% |
+-----------------+-----------------------+------------------------+---------+
+-----------+---------------+
| p_value | chi_squared |
+===========+===============+
| 0.943927 | 0.38204 |
+-----------+---------------+
Is there anything specific you want me to change?
@ColinKCurtis I got 3 working. Had to patch Flasgger. So help is no longer needed for this feature request.
Thanks, Hao. Sorry for the confusion.
@xu-hao @ColinKCurtis : Can we add back a link to Hao's GitHub site above or below the link to my general-user documentation?
FYI:
This query (see curl in second screenshot):
Yielded this result:
Could be an error on my part. Not sure, however.
@xu-hao @ColinKCurtis : Can we add back a link to Hao's GitHub site above or below the link to my general-user documentation?
I am looking at this now!
@karafecho i fixed the bug. thanks for finding it.
@xu-hao @ColinKCurtis : I am wondering what the best approach is for pointing to plain-text documentation on ICEES. The information currently is available through a hyperlink from the UI to a Google folder, but I'm not sure this is the best approach. Thoughts?
@karafecho Simplest is just check them into github. Or we can check it into github and use github pages. Or we can use readthedocs. Given the current amount of documentation probably checking them into github is sufficient.
@xu-hao @ColinKCurtis : I don't think we need to include any statistics (Chi Square, P values) for functionality two: feature-rich cohort discovery.
ok, i'll remove those stats.
it's done.
@xu-hao @ColinKCurtis : Functionality 2, feature-rich cohort discovery, is not working for the visit-level tables, apparently because those tables do not contain the feature variable "TotalEDInpatientVisits".
Actually, I think something generally is wrong with the visit-level tables.
I'll take a look at this today.
@karafecho i fixed a bug that is causing some of the errors. other errors can be resolved by changing the table from patient to visit.
Oops! Well, that was a stupid mistake.
This works fine now. Thanks, Hao.
I'm having some issues with the ICEES API, specifically, functionality 4. The service isn't returning all feature associations. For instance, AvgDailyPM2.5Exposure x TotalEDInpatientVisits isn't returned for year 2011 patient-level queries. I checked the underlying tables, however, and the data are definitely there. Plus, I can generate the output using functionality 3 for 2 x 2 tables.
I'm also having trouble generating results for functionality 4, year 2010, patient-level queries (see example below). This started today; I did not have trouble on Friday, except for the missing feature associations using functionality 4.
The reason "AvgDailyPM2.5Exposure x TotalEDInpatientVisits" is not showing is because the maximum_p_value is too low. If you increase it to 1, it will show up.
Ahh! That makes perfect sense. I was confused because SAS JMP does not filter results, and I've been comparing output from the API with output from SAS JMP. I really need to keep these caveats in mind!
@xu-hao : Okay, I'm confused. COHORT:22 was originally defined as {} (all patients), 2010, patient table, v 1.0.0. I know this because I used that cohort to generate the data in the ICEES MS. Plus, Colin used that cohort in the Jupyter notebook. But now it is defined as follows:
{ "return value": [ { "size": 3958, "cohort_id": "COHORT:22", "features": { "TotalEDInpatientVisits": { "operator": "<", "value": 2 }, "MaxDailyPM2.5Exposure": { "operator": ">", "value": 2 } } },
I created COHORT:56 to create a new cohort defined the same as COHORT:22 originally was, but I'm concerned because COHORT:22 is referenced in many places now. Is it possible to delete it and then redefine it? If so, we can delete COHORT:56.
Yes, you can use the PUT operator on the Flasgger UI to override cohort definition. I've already restore COHORT:22 you can try it again.
On Tue, Oct 30, 2018 at 1:11 PM karafecho notifications@github.com wrote:
@xu-hao https://github.com/xu-hao : Okay, I'm confused. COHORT:22 was originally defined as {} (all patients), 2010, patient table, v 1.0.0. I know this because I used that cohort to generate the data in the ICEES MS. Plus, Colin used that cohort in the Jupyter notebook https://github.com/ncats/translator-workflows/blob/master/greengamma/general/ICEES_API_Notebook.ipynb. But now it is defined as follows:
{ "return value": [ { "size": 3958, "cohort_id": "COHORT:22", "features": { "TotalEDInpatientVisits": { "operator": "<", "value": 2 }, "MaxDailyPM2.5Exposure": { "operator": ">", "value": 2 } } },
I created COHORT:56 to create a new cohort defined the same as COHORT:22 originally was, but I'm concerned because COHORT:22 is referenced in many places now. Is it possible to delete it and then redefine it? If so, we can delete COHORT:56.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/134#issuecomment-434387330, or mute the thread https://github.com/notifications/unsubscribe-auth/ABtRbRGNUchlD46970ognVr509GDad4rks5uqIgqgaJpZM4UXbZG .
Excellent! Thanks, Hao. And yes, I totally used the wrong operator. Sorry!
Thanks for removing the 2011 tables from the ICEES API, Hao. As I mentioned, I think we need to refine the binning strategy for the CMAQ exposures in those tables before we openly release them.
once i get the FHIR data from James, i can try one of the neural network based approach. what do you think?
On Mon, Nov 5, 2018 at 5:27 PM karafecho notifications@github.com wrote:
Thanks for removing the 2011 tables, Hao. As I mentioned, I think we need to refine the binning strategy for the CMAQ exposures in those tables.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/134#issuecomment-436059063, or mute the thread https://github.com/notifications/unsubscribe-auth/ABtRbbDXTy7iiqtmzgvGsY6078I7b4Ipks5usLs3gaJpZM4UXbZG .
Sure, but let's chat first. I'm finding race-based differences in the PM2.5 distribution that are interesting and may have been masked had we optimized binning.
@xu-hao @ColinKCurtis : WRT binning, I'm convinced that we will never identify a one-strategy-fits-all approach. Instead, I'm wondering if we should:
Both are doable.
On Mon, Nov 12, 2018 at 1:38 PM karafecho notifications@github.com wrote:
@xu-hao https://github.com/xu-hao @ColinKCurtis https://github.com/ColinKCurtis : WRT binning, I'm convinced that we will never identify a one-strategy-fits-all approach. Instead, I'm wondering if we should:
- Optimize binning by precalculating tables using multiple binning strategies and incorporating a feature in the API/UI to allow users to select their preferred binning strategy.
- Optimize binning by incorporating a feature in the API/UI to allow users to collapse contiguous bins. For example, this feature would allow users to select the preferred number of bins for TotalEDInpatientVisits, MajorRoadwayHighwayExposure, etc.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/134#issuecomment-437971570, or mute the thread https://github.com/notifications/unsubscribe-auth/ABtRbRjNgxVORvEXTq0xKRtRUzmGNpwFks5uubUCgaJpZM4UXbZG .
Great! Let's plan on this, as we discussed earlier today.
Hao modified the API to include the binning optimization strategy (2) on 11/19/18. To be deployed.
Steve Appold (Kenan-Flagler) sent a new data pull for the nationwide ACS data. The estimates are similar, but not identical, to the estimates that he sent previously (he used a different source file). He did not include the variable "EstProbabilityNonHispWhite" (not sure why), although he did include standard error estimates. These will be used by FHIR PIT to generate the new integrated feature tables.
11/20/18: Hao deployed the new API. New operation is temporarily titled "feature_association2".
Example operation:
{ "feature_a":{ "AgeStudyStart":[ { "operator":"=", "value":"0-2" }, { "operator":"between", "value_a":"3-17", "value_b":"18-34" }, { "operator":"in", "values":["35-50","51-69"] },{ "operator":"=", "value":"70+" } ] }, "feature_b":{ "ObesityBMI":[ { "operator":"=", "value":0 }, { "operator":"<>", "value":0 } ] } }
@xu-hao : I cannot get the new functionality to run. Is there something wrong with my syntax?
{{ "feature_a":{ "AvgDailyPM2.5Exposure":[ { "operator":"in", "values":[1,2] }, { "operator":"=", "value":3, }, { "operator":"in", "values":[4,5], }} ] }, {"feature_b":{ "TotalEDInpatientVisits":[ { "operator":"<", "value":2 }, ] } }}
@xu-hao : When you get a chance, please replace paragraph three of the current DUA-like boilerplate with the following text:
"We kindly request that users of this service provide proper attribution for any products (e.g., manuscripts, podium presentations, software) derived from work related to ICEES. Attribution should include acknowledgement of the funder (National Center for Advancing Translational Sciences, National Institutes of Health, Biomedical Data Translator Program award [OT3TR002020 and OT2TR002514] and Center for Translational Science award [UL1TR002489], UNC Hospitals and Health Care System, and all team members who contributed to the work."
It's update.
From: karafecho notifications@github.com Sent: Friday, February 8, 2019 4:30 PM To: ResearchSoftwareInstitute/greendatatranslator Cc: Xu, Hao; Mention Subject: Re: [ResearchSoftwareInstitute/greendatatranslator] Develop Swagger UI for ICEES Smart API (#134)
@xu-haohttps://github.com/xu-hao : When you get a chance, please replace paragraph three of the current DUA-like boilerplate with the following text:
"We kindly request that users of this service provide proper attribution for any products (e.g., manuscripts, podium presentations, software) derived from work related to ICEES. Attribution should include acknowledgement of the funder (National Center for Advancing Translational Sciences, National Institutes of Health, Biomedical Data Translator Program award [OT3TR002020 and OT2TR002514] and Center for Translational Science award [UL1TR002489], UNC Hospitals and Health Care System, and all team members who contributed to the work."
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/134#issuecomment-461953356, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABtRbbykj2iY_CJNCI3hfo_q8adoYhN-ks5vLexigaJpZM4UXbZG.
Thanks, but I modified the wording slightly:
"We kindly request that users of this service provide proper attribution for any products (e.g., manuscripts, podium presentations, software) derived from work related to ICEES. Attribution should include acknowledgement of the funder (National Institutes of Health, National Center for Advancing Translational Sciences, Biomedical Data Translator Program [awards OT3TR002020 and OT2TR002514] and Center for Translational Science Program [award UL1TR002489]), UNC Hospitals and Health Care System, and all team members who contributed to the work."
Sorry!
It's updated.
From: karafecho notifications@github.com Sent: Friday, February 8, 2019 5:51 PM To: ResearchSoftwareInstitute/greendatatranslator Cc: Xu, Hao; Mention Subject: Re: [ResearchSoftwareInstitute/greendatatranslator] Develop Swagger UI for ICEES Smart API (#134)
Thanks, but I modified the wording slightly:
"We kindly request that users of this service provide proper attribution for any products (e.g., manuscripts, podium presentations, software) derived from work related to ICEES. Attribution should include acknowledgement of the funder (National Institutes of Health, National Center for Advancing Translational Sciences, Biomedical Data Translator Program [awards OT3TR002020 and OT2TR002514] and Center for Translational Science Program [award UL1TR002489]), UNC Hospitals and Health Care System, and all team members who contributed to the work."
Sorry!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/134#issuecomment-461974296, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABtRbbyOO35Nks2YyQmqHsQ6sa95bqmmks5vLf9rgaJpZM4UXbZG.
Excellent! Thanks, Hao. Kara
The purpose of this issue is to develop a user-friendly UI to interface with the Smart API for the DDCR Service. This will allow SMEs and other users to access the DDCR Service through drop-down menus and other features, without having to execute curl commands.