Closed Bisaloo closed 2 months ago
Optionally we could explore using the newly released tools::pkg2HTML
function (R@4.4.0) to generate the documentation for the relevant packages as input.
@avinashladdha:
In your experience, what kind of text would be more informative for LLM? Technical documentation of each function (e.g., https://epiverse-trace.github.io/linelist/reference/validate_linelist.html) or tutorials / walkthroughs? Or as much content as possible?
What kind of format would you prefer to receive this data in. We can take care of writing the scraper but it's unclear to me how the data should be pre-processed and stored before handing it over to you:
Please let me know if it'd be more productive to talk through this. I'm happy to set aside 30 min aside for the 3 of us to resolve this conversation and start writing the scraper so you have the data you need for the next steps.
Depending upon our current usecase (Natural search for package functionality) tutorials/waalkthroughs would be relatively more useful. They can serve as a detailed version of technical information for most of the user searches.
Markdown, JSON or text files would be healpful.
Documenting some takeaways from our latest discussion here for transparency:
├── tool1/ *(replace by actual name)*
│ ├── vignette1.Rmd *(replace by actual name)*
│ ├── vignette2.Rmd
│ └── manual.md
├── tool2/
.
.
.
└── toolN/
As described in #1, we have the full list of packages (= potential search results) in https://cran.r-project.org/view=Epidemiology. But this doesn't completely resolve the question of where the data to describe these packages to the LLM comes from. As far as I can tell, we have a couple of different options.
Package Description
All R packages include a description of what they are about, with potential references. This description can be highly variable in size but on average is around 3-4 sentences.
Example for linelist:
Example for epicontacts:
Package vignettes
Package vignettes are a longer form of documentation that introduces concepts and usage of the package via literate programming.
Example for linelist:
Examples for epicontacts:
Package manual
The package manual (pdf or html (https://github.com/epiverse-connect/epiverse-search/issues/2#issuecomment-2097618994)) contains a list of functions, their goal, usage, inputs and outputs, with examples. It is also somewhat more standardized by CRAN than the previously mentioned data sources
Example for linelist: https://cran.r-project.org/web/packages/linelist/linelist.pdf
Examples for epicontacts: https://cran.r-project.org/web/packages/epicontacts/epicontacts.pdf
pkgdown website
The pkgdown website is a one-stop shop for R package documentation, putting in one place the package description, README, vignettes, manual, release notes, etc. Unfortunately, not all packages have a pkgdown webste.
Example for linelist: https://epiverse-trace.github.io/linelist/
Example for epicontacts: https://www.repidemicsconsortium.org/epicontacts/
A mix of different sources
It may also be possible to use all the available sources or a mix of them.