epiverse-connect / epiverse-search

MIT License
0 stars 0 forks source link

Identify data sources for the search engine #2

Closed Bisaloo closed 2 months ago

Bisaloo commented 6 months ago

As described in #1, we have the full list of packages (= potential search results) in https://cran.r-project.org/view=Epidemiology. But this doesn't completely resolve the question of where the data to describe these packages to the LLM comes from. As far as I can tell, we have a couple of different options.

Package Description

All R packages include a description of what they are about, with potential references. This description can be highly variable in size but on average is around 3-4 sentences.

Example for linelist:

Provides tools to help storing and handling case line list data. The 'linelist' class adds a tagging system to classical 'data.frame' objects to identify key epidemiological data such as dates of symptom onset, epidemiological case definition, age, gender or disease outcome. Once tagged, these variables can be seamlessly used in downstream analyses, making data pipelines more robust and reliable.

Example for epicontacts:

A collection of tools for representing epidemiological contact data, composed of case line lists and contacts between cases. Also contains procedures for data handling, interactive graphics, and statistics.

Package vignettes

Package vignettes are a longer form of documentation that introduces concepts and usage of the package via literate programming.

Example for linelist:

Examples for epicontacts:

Package manual

The package manual (pdf or html (https://github.com/epiverse-connect/epiverse-search/issues/2#issuecomment-2097618994)) contains a list of functions, their goal, usage, inputs and outputs, with examples. It is also somewhat more standardized by CRAN than the previously mentioned data sources

Example for linelist: https://cran.r-project.org/web/packages/linelist/linelist.pdf

Examples for epicontacts: https://cran.r-project.org/web/packages/epicontacts/epicontacts.pdf

pkgdown website

The pkgdown website is a one-stop shop for R package documentation, putting in one place the package description, README, vignettes, manual, release notes, etc. Unfortunately, not all packages have a pkgdown webste.

Example for linelist: https://epiverse-trace.github.io/linelist/

Example for epicontacts: https://www.repidemicsconsortium.org/epicontacts/

A mix of different sources

It may also be possible to use all the available sources or a mix of them.

chartgerink commented 6 months ago

Optionally we could explore using the newly released tools::pkg2HTML function (R@4.4.0) to generate the documentation for the relevant packages as input.

Bisaloo commented 4 months ago

@avinashladdha:

Please let me know if it'd be more productive to talk through this. I'm happy to set aside 30 min aside for the 3 of us to resolve this conversation and start writing the scraper so you have the data you need for the next steps.

avinashladdha commented 4 months ago
Bisaloo commented 3 months ago

Documenting some takeaways from our latest discussion here for transparency: