epiverse-connect / epiverse-scraper

1 stars 0 forks source link

Consider using r-universe API for documentation dump #2

Open Bisaloo opened 3 days ago

Bisaloo commented 3 days ago

Rather than using the GitHub API.

https://epiverse-connect.r-universe.dev/api/snapshot/zip?types=docs

Options presented on https://epiverse-connect.r-universe.dev/apis

chartgerink commented 2 days ago

Would a set of HTMLs be okay for you too @avinashladdha? It definitely seems like what @Bisaloo is suggesting is way straightforward if it includes all the information we want 😊

Bisaloo commented 2 days ago

One thing to note is that this endpoint doesn't include vignettes so we would still need an alternative option to collect vignettes.

Bisaloo commented 1 day ago

One alternative along the same lines would be to download the dump of package source from r-universe and get all the relevant files locally.

Quite similar to the current process but with local operations rather than via GitHub API.

avinashladdha commented 1 day ago

HTML is okay to be ingested when calculating embedings instead of .md, a couple of points to keep in mind:

  1. Post processings: It would require more post processing. We will need to find relevant tags and associated text. In case the text under elements (

    ,

    etc) are unstrucutred we need to process them before calculating embeddings. We can use some different model which are more suited to create embeddings on HTML documents. (which still is not a sure shot from what I have read)
  2. Dynamic content: Not sure if this would be relevant to us, but in case the website is using Javascript to generate content, getting the final rendered conttent might be diofficult.