Open maia-sh opened 8 months ago
Hi @maia-sh ! Thank you so much for all the feedback. I review the observations until monday morning so that we can (hopefully) confirm the meeting with Peter, and if I have any doubts I will ping you on teams. And I really appreciate the code comments/tips to improve my style and clearness!
@maia-sh, a quick note on local only data: agree on pushing the data with exceptions (size, personal data). In this case, I suggested to keep local-only because we had information in comments, and we were still iterating. Now that we're further along, and providing @elisabascunan is also ok with it, I would suggest that we simply rename any comments columns that contain names, do a quick check of the info in comments, and push both files to the repo.
@elisabascunan, if you do these changes, could you make them directly in the google doc, and then re-download the files? Or did you make additional changes to the files locally that are not in the google doc? Thanks!
Hi @maia-sh I have one doubt to apply the alternative approach for formatting tables you suggested:
#| label: alternative-registry-count
#Install packages if not already installed
install.packages("remotes") # to install packages from github
remotes::install_github("maia-sh/ctregistries") # to analyse clinical trial registries
library(ctregistries)
I cannot install the package ctregistries from github, and I think it is happening because I require access to the repo. Could you check if I have access to it? Other than that, I have no other doubts right now.
@delwen I'm ok with checking the comment column of the file, and push it on the repo. I will come back to you if I find any problems or to clarify how to correctly do it. Thank you for the suggestion!
Summary:
1. Descriptive analysis: What are the tools finding?
2. 50/50 checks (positive controls): If all tools pull out the same ID, is it indeed a valid ID (any registration id)?
3. 50/50 checks (negative controls): If none of the tools pull out an ID, is there indeed no ID (any registration id)?
4. Performance analysis: How “good” is each tool at detecting TRNs (since this is what they have in common)?
**I seem to remember other tool groups doing this analysis only where tools disagreed, but not sure
Here are the 3 false negatives from ctregistries : |
Identifier | PMCID | Thoughts on why missed |
---|---|---|---|
NCT0002523 | PMC8798459 | too short | |
NCT2927821 | PMC8745481 | too short | |
NCT01768585 | PMC8024658 | unclear, found when rerun |
NCT01768585: @delwen seems to have figured out the issue was with a space and difference between tool output. Peter will resolve in input dataset. This false negative should disappear.
@elisabascunan : could you please try to track back NCT0002523 and NCT2927821 from the papers they're mentioned in, to their registration (perhaps via another paper)? If you can add links for any steps to make it easier to follow, that would be helpful. Just posting a reply here, please. For example: NCT0002523, on a quick look, I see it indeed in PMC8798459 but in reference to another paper, and I don't find a registration for NCT0002523. So these may actually be false.
Hi @maia-sh
I tracked back the two shorter IDs and found that mainly they were misquoted in the papers we screened, because when I found the original paper and compared the two ids, it seems that they missed a "0".
Thanks for looking into this @elisabascunan ! I realize we forgot to touch base on this in the meeting earlier, but I do think this should be mentioned in your report (and the extended results we eventually add) - would you mind adding in your round of edits? The challenge is there are the manual checks were for whether the id found by one or more tool(s) was truly in the paper, which isn't necessarily the same as whether the id found by one or more tool(s) was truly a valid TRN. This is yet another level of complexity and maybe something to note that TRNscreener checks whether purported ctgov ids resolve.
Kudos @elisabascunan on this fantastic work đź‘Ź Your use of quarto as well solid coding, in addition to your research insights in the main analysis guided by @delwen as well as your "Additional observations" are stellar. During my review, I had some comments regarding both the analysis itself (e.g., questions we're asking, interesting observations) and code/data (e.g., style). The code comments are more for your development, if you like, and not at all important for this internal report. I address them all below and indicate what type of comment.
Data
https://github.com/delwen/screenit-tool-comparison/blob/2ce19a4e64815f73fb728d7c7999353c47f9f674/descriptive-analysis.qmd#L41-L42
regset.csv
andnc_man_check.csv
are local-only, i.e., not saved in the repo and should be added to be used. We try to keep our data alongside code in repos, by default, and with exceptions (e.g., large data, sensitive data, etc.). Why is the data local-only for this project? No problem with a rationale, but I'd suggest adding a note explaining this and what is needed to document before you read in the first csv; or adding it to the repo (perhaps touch base with @delwen, since this may have been a previous discussion I'm missing). Also I see you used.gitignore
for the data folder which is great practice when working with local only files!regset.csv
in order to understand which column was which and how they play into the analysis. As a reader, I find it super helpful to have a section at the start with some description of the data to come (side note: as a writer, I find making this description annoying or superfluous, since while I'm deep in the data, this feel unnecessary, but even Future Me is a reader who appreciates the documentation!). This also motivated me to add a section to our onboarding on documentation, so feel free to check it out. Some aspects to consider documenting:regset.csv
were manually coded by you, and it would be great to make that clear, as I had to look a few times to figure this out.Code
Coding style: Your code is clear and easy to follow - great job! Most of our code is built in tidyverse style, which is built on and differs from base R (which I see predominantly is your document) in its grammar. Personally, I found it tricky at first (and used only base R for a while) and now find it much easier to both write and read. This is not a concrete feedback item but rather a suggestion that if you happen to find yourself between tasks and interested, you could review the resources mentioned in the onboarding on coding style. I mention this here as my feedback below will be mostly in tidyverse style.
Formatting tables As you likely noticed, "raw" tables from R don't look so good in quarto (at least when they are larger). A simple way to create a formatted table is to add
knitr::kable()
after the table you want to display. Currently, you usetable()
, add a screenshot, and manually list the registry names: https://github.com/delwen/screenit-tool-comparison/blob/2ce19a4e64815f73fb728d7c7999353c47f9f674/descriptive-analysis.qmd#L52-L91 Here is an alternate approach, using existing data inctregistries
package (you should first install ctregistries), tidyverse, and kable:If you want to try this out, you could see if you can replace the screenshot you use here with code: https://github.com/delwen/screenit-tool-comparison/blob/2ce19a4e64815f73fb728d7c7999353c47f9f674/descriptive-analysis.qmd#L102-L104
Analysis
Your analysis is very insightful and there's a lot (too much) we can write up from it. A few remaining questions jumped out at me.
regset.csv
includes only hits and is not a random sample of the entire population, so we can't evaluate true precision and recall. You have the building blocks already. Essentially, we need each*_hit
column (for prediction) andid_type_group
(for actual).id_type_group
needs to be a boolean (i.e., TRUE/FALSE), so make "false_positive" and "protocol" FALSE (@delwen: I believe this is correct per past conversations with Peter, but correct me if I'm wrong). Precision: What proportion of positive identifications was actually correct? Recall: What proportion of actual positives was identified correctly? I went ahead and did this and now think that we can't do this pseudo-approach, but let's check in with Peter about thisdata |> filter(id_type_group == "trn") |> count(ctregistries_hit, sciscore_hit, trialidentifier_hit, nct_hit)