Pull request: SciScore additional analysis

elisabascunan commented 3 months ago

For SciScore analysis, there are two new tables with:

total number of ctgov IDs found in methods section
total number of ctgov IDs found by ctgov that were in methods section.

Additionally, the titles of the location ID per tool plots were modified, elimintating the phrase "detected by toolname"

delwen commented 3 months ago

@elisabascunan see below some additional changes needed to the dataset. Please review the corrections again before incorporating them as a sanity check. I think the easiest is if you make these changes in the Google Sheet, and then read in the updated dataset in your analysis workflow. That way the changes will be integrated everywhere (but please push back on this if you're seeing an issue I overlooked!).

Rename "TrialIdentifier" to "TRNscreener" in line with the updates in the main paper.
NCT0002523: when I looked into some of the IDs/papers that were not detected by SciScore, I saw that this ID is in the results section of the paper (and not in the methods as indicated in the extraction form), so we should correct this in the extraction form and analysis.
NCT00418041: this was also detected in the notes/metadata of the article. We could potentially reflect this by setting id_in_other_location to TRUE (+ comment), unless you didn't typically consider metadata?
NCT01768585: right now, we have two entries for this trial ID (NCT01768585 and NCT 01768585) for the same paper (PMC8024658). We decided that we will collapse these, since we only have a second row for this ID because ctregistries is cleaning the detected identifier. In reality, the ID as written in the paper was found by TrialIdentifier (or TRNscreener), ctregistries, and NCT naive tool. So concretely, you can merge these two rows and keep as Identifier "NCT01768585" and set sciscore_hit to FALSE, and trialidentifier_hit (or trnscreener_hit), ctregistries_hit, and nct_hit to TRUE. Ideally add a comment explaining that these two entries were merged so we can keep track. Note that this resolves one of the questions we had in this issue: https://github.com/delwen/screenit-tool-comparison/issues/2#issuecomment-1976881390.
Once these changes are made, re-generate the analysis.

Thanks, and let me know if this raises any questions!

elisabascunan commented 3 months ago

@delwen thank you for the feedback! I made the changes on the dataset and updated the file regset.csv accordingly. I also commited the newest version of the quarto report with the correct numbers. Let me know if you need me to add anything else.

elisabascunan commented 1 month ago

HI @delwen

I just went through most of the comments and I only have my doubts with the last one regarding shorter NCTs (stated above as a reply). To answer the main questions of this review:

NCT00418041 (metadata): I only checked the pdfs and I wanted to clarify what you mean by metadata. In this particular case, I thought I only missed it on the file as it is in the first page (see attached photo). If that is a section that you considered as metadata, then adding this category would be consistent with the process. However, if it is something entirely different and not available from a pdf, then we might need to change this location category.

Screenshot 2024-05-30 at 09 26 07

Numbers in Quarto: I completed the manual review, but I will look into the option of replacing the numbers with inline code.

I will wait for your comments before pushing the new version. Thank you in advance!

delwen / screenit-tool-comparison

Pull request: SciScore additional analysis #5