Kudos @elisabascunan on this fantastic work 👏 Your use of quarto as well solid coding, in addition to your research insights in the main analysis guided by @delwen as well as your "Additional observations" are stellar. During my review, I had some comments regarding both the analysis itself (e.g., questions we're asking, interesting observations) and code/data (e.g., style). The code comments are more for your development, if you like, and not at all important for this internal report. I address them all below and indicate what type of comment.

Data

https://github.com/delwen/screenit-tool-comparison/blob/2ce19a4e64815f73fb728d7c7999353c47f9f674/descriptive-analysis.qmd#L41-L42

Local-only data: Per your email, the regset.csv and nc_man_check.csv are local-only, i.e., not saved in the repo and should be added to be used. We try to keep our data alongside code in repos, by default, and with exceptions (e.g., large data, sensitive data, etc.). Why is the data local-only for this project? No problem with a rationale, but I'd suggest adding a note explaining this and what is needed to document before you read in the first csv; or adding it to the repo (perhaps touch base with @delwen, since this may have been a previous discussion I'm missing). Also I see you used .gitignore for the data folder which is great practice when working with local only files!
Data description and sample: I found myself flipping back and forth between the report and regset.csv in order to understand which column was which and how they play into the analysis. As a reader, I find it super helpful to have a section at the start with some description of the data to come (side note: as a writer, I find making this description annoying or superfluous, since while I'm deep in the data, this feel unnecessary, but even Future Me is a reader who appreciates the documentation!). This also motivated me to add a section to our onboarding on documentation, so feel free to check it out. Some aspects to consider documenting:
- What data files are used in this analysis? Name and describe them. Also, if the data are local and need to be added, communicate this.
- What variables and how were they produced? Some variables of regset.csv were manually coded by you, and it would be great to make that clear, as I had to look a few times to figure this out.

Code

Coding style: Your code is clear and easy to follow - great job! Most of our code is built in tidyverse style, which is built on and differs from base R (which I see predominantly is your document) in its grammar. Personally, I found it tricky at first (and used only base R for a while) and now find it much easier to both write and read. This is not a concrete feedback item but rather a suggestion that if you happen to find yourself between tasks and interested, you could review the resources mentioned in the onboarding on coding style. I mention this here as my feedback below will be mostly in tidyverse style.
Formatting tables As you likely noticed, "raw" tables from R don't look so good in quarto (at least when they are larger). A simple way to create a formatted table is to add knitr::kable() after the table you want to display. Currently, you use table(), add a screenshot, and manually list the registry names: https://github.com/delwen/screenit-tool-comparison/blob/2ce19a4e64815f73fb728d7c7999353c47f9f674/descriptive-analysis.qmd#L52-L91 Here is an alternate approach, using existing data in ctregistries package (you should first install ctregistries), tidyverse, and kable:

#| label: alternative-registry-count

# Install packages if not already installed
# install.packages("remotes") # to install packages from github
# remotes::install_github("maia-sh/ctregistries") # to analyse clinical trial registries

library(ctregistries)

# Display registries and full names
ctregistries::registries |> 
  select(registry, databank_full_name) |> 
  knitr::kable()

# Use manually entered registry names, as they differ from ctregistries
registries <- c("ctgov", "umin", "drks", "irct", "chictr", "isrctn","ctri", "eudract", "actrn", "jrct", "kct", "ntr", "pactr")
# setdiff(tolower(ctregistries::registries$registry), registries)
# setdiff(registries, tolower(ctregistries::registries$registry))

# Display counts of ids per registry
data |> 
  filter(id_type %in% registries) |> 
  count(id_type) |> 
  arrange(desc(n)) |> 
  knitr::kable()

quarto-example

If you want to try this out, you could see if you can replace the screenshot you use here with code: https://github.com/delwen/screenit-tool-comparison/blob/2ce19a4e64815f73fb728d7c7999353c47f9f674/descriptive-analysis.qmd#L102-L104

Analysis

Your analysis is very insightful and there's a lot (too much) we can write up from it. A few remaining questions jumped out at me.

Pseudo precision and recall analysis: I say pseudo as regset.csv includes only hits and is not a random sample of the entire population, so we can't evaluate true precision and recall. You have the building blocks already. Essentially, we need each *_hit column (for prediction) and id_type_group (for actual). id_type_group needs to be a boolean (i.e., TRUE/FALSE), so make "false_positive" and "protocol" FALSE (@delwen: I believe this is correct per past conversations with Peter, but correct me if I'm wrong). Precision: What proportion of positive identifications was actually correct? Recall: What proportion of actual positives was identified correctly? I went ahead and did this and now think that we can't do this pseudo-approach, but let's check in with Peter about this

{r}
#| label: analysis-precision-recall

install.packages("yardstick") # to analyze tool performance
library(yardstick)

data_analysis <-
  data |> 

  # `id_type_group` represents the truth/actual
  # Recode to boolean and treat "protocol" as "false_positive"
  mutate(is_trn = if_else(id_type_group == "trn", TRUE, FALSE)) |> 

  # yardstick functions require factors so recode booleans to factors
  mutate(across(c(ends_with("_hit"), is_trn), as.factor)) |> 

  select(is_trn, ends_with("_hit"))

# Create function to apply all metrics
evalutate_tools <-
  metric_set(accuracy, precision, recall, f_meas, spec, sens)

# Evaluate tools
bind_rows(
  evalutate_tools(data_analysis, truth = is_trn, estimate = sciscore_hit) |> 
  mutate(tool = "sciscore"),
  evalutate_tools(data_analysis, truth = is_trn, estimate = trialidentifier_hit) |>
  mutate(tool = "trialidentifier"),
  evalutate_tools(data_analysis, truth = is_trn, estimate = ctregistries_hit) |>
  mutate(tool = "ctregistries"),
  evalutate_tools(data_analysis, truth = is_trn, estimate = nct_hit) |>
  mutate(tool = "nct")
) |> 
  relocate(tool, .before = 0) |> 
  select(-".estimator") |> 
  tidyr::pivot_wider(names_from = "tool", values_from = ".estimate")

Comparative descriptives: The independent explorations of each tool are great. It would also be helpful to look at them together. For example, this quick comparison already brings insights (e.g., ctregistries misses only 3 and finds 28 that are found by no other tool). Maybe poke around a bit more.
```
#| label: comparative-descriptives
```

data |> filter(id_type_group == "trn") |> count(ctregistries_hit, sciscore_hit, trialidentifier_hit, nct_hit)


7. **False positives**: You could poke around into false positives more. Now that you have the "what" is happening, you can think more about "why." Perhaps no additional programming and rather just looking at the table; as you see fit. As an end goal, you could write up a short paragraph on the false positives, incorporating also what you learned in the initial analysis (e.g., all from ctregistries) and this second pass.
    a. _Location_: One reason we asked you to capture the location where the "trn" is detected is that we expect more false positives in sections beyond the methods, such as results (e.g., from tables being smushed). It would be interesting to take a look at where false positives are coming from. What sections? Do those sections also have true trns? Why might it be these sections?
    b. _Registries_: Are certain registries' trn patterns driving the false positives? And what type of false positives? Why might it be these patterns?

Again, amazing job, @elisabascunan !!!

Hi @maia-sh ! Thank you so much for all the feedback. I review the observations until monday morning so that we can (hopefully) confirm the meeting with Peter, and if I have any doubts I will ping you on teams. And I really appreciate the code comments/tips to improve my style and clearness!

@maia-sh, a quick note on local only data: agree on pushing the data with exceptions (size, personal data). In this case, I suggested to keep local-only because we had information in comments, and we were still iterating. Now that we're further along, and providing @elisabascunan is also ok with it, I would suggest that we simply rename any comments columns that contain names, do a quick check of the info in comments, and push both files to the repo.

@elisabascunan, if you do these changes, could you make them directly in the google doc, and then re-download the files? Or did you make additional changes to the files locally that are not in the google doc? Thanks!

Hi @maia-sh I have one doubt to apply the alternative approach for formatting tables you suggested:

#| label: alternative-registry-count
#Install packages if not already installed
install.packages("remotes") # to install packages from github
remotes::install_github("maia-sh/ctregistries") # to analyse clinical trial registries

library(ctregistries)

I cannot install the package ctregistries from github, and I think it is happening because I require access to the repo. Could you check if I have access to it? Other than that, I have no other doubts right now.

@delwen I'm ok with checking the comment column of the file, and push it on the repo. I will come back to you if I find any problems or to clarify how to correctly do it. Thank you for the suggestion!

Summary:

1. Descriptive analysis: What are the tools finding?

Sample: PMC set where there was a hit for any tool
Excluded rtransparent as could not get the detected statement, so can’t assess what was found in manual check
Unit of analysis: detected ID
Manual check + potential viz is underway (Elisa)

2. 50/50 checks (positive controls): If all tools pull out the same ID, is it indeed a valid ID (any registration id)?

Sample: PMC set where all tools agree and returned id (regardless of id type)
Unit of analysis: ID
Excluded rtransparent as could not get the detected statement/ID

3. 50/50 checks (negative controls): If none of the tools pull out an ID, is there indeed no ID (any registration id)?

Sample: PMC set where all tools agree and did not return anything
Unit of analysis: papers (no parity, but not seeing a way around this)
Excluded rtransparent for parity

4. Performance analysis: How “good” is each tool at detecting TRNs (since this is what they have in common)?

Sample: PMC set where there was a TRN hit for any tool**
Exclude any false positives for TRNs (detected in the manual check from #1)
Exclude rtransparent since we can’t get the detected statement and hence check if true positive
Unit of analysis: ID

**I seem to remember other tool groups doing this analysis only where tools disagreed, but not sure

Here are the 3 false negatives from `ctregistries`:	Identifier	PMCID
NCT0002523	PMC8798459	too short
NCT2927821	PMC8745481	too short
NCT01768585	PMC8024658	unclear, found when rerun

NCT01768585: @delwen seems to have figured out the issue was with a space and difference between tool output. Peter will resolve in input dataset. This false negative should disappear.

@elisabascunan : could you please try to track back NCT0002523 and NCT2927821 from the papers they're mentioned in, to their registration (perhaps via another paper)? If you can add links for any steps to make it easier to follow, that would be helpful. Just posting a reply here, please. For example: NCT0002523, on a quick look, I see it indeed in PMC8798459 but in reference to another paper, and I don't find a registration for NCT0002523. So these may actually be false.

Hi @maia-sh

I tracked back the two shorter IDs and found that mainly they were misquoted in the papers we screened, because when I found the original paper and compared the two ids, it seems that they missed a "0".

ID | paper where it is from | quote | registration | comment -- | -- | -- | -- | -- NCT0002523 | https://pubmed.ncbi.nlm.nih.gov/PMC8798459/ | All studies except the NCT0002523 trial conducted by Bosset et al. (26) | According to original paper (Bosset) is at ClinicalTrials.gov number, NCT00002523. | The paper quoting this study missed a 0 in the ID used. That is why the NCT is shorter and why it does not show up on the registry. NCT2927821 | https://pubmed.ncbi.nlm.nih.gov/PMC8745481/ | ADS Plus Clinical Trial #NCT-2927821 | When searching for ADS Plus Clinical study ClinicalTrials.gov there is a study with the number NCT02927821 | It seems the quoted study on the paper we screened missed a 0 and that could be why its shorter in length.

Thanks for looking into this @elisabascunan ! I realize we forgot to touch base on this in the meeting earlier, but I do think this should be mentioned in your report (and the extended results we eventually add) - would you mind adding in your round of edits? The challenge is there are the manual checks were for whether the id found by one or more tool(s) was truly in the paper, which isn't necessarily the same as whether the id found by one or more tool(s) was truly a valid TRN. This is yet another level of complexity and maybe something to note that TRNscreener checks whether purported ctgov ids resolve.

delwen / screenit-tool-comparison

Review of descriptive analysis #2

Data

Code

Analysis