astropy / learn-astropy-librarian

The content crawler that supplies Learn Astropy's web search.
BSD 3-Clause "New" or "Revised" License
0 stars 2 forks source link

Initial JupyterBook ingest feature #13

Closed jonathansick closed 3 years ago

jonathansick commented 3 years ago

This is a big PR that essentially gets Astropy Librarian to a state where we can use it to populate guide and tutorial content from the Learn Astropy front end.

The main feature is the new capability to ingest JupyterBooks, which are our primary format for writing guides.

This PR also improves the existing ingest workflow for tutorial notebooks by leveraging Pydantic models — Pydantic is like dataclasses but with enhanced validation of datasets and extra control over how data is exported and serialized.

The Algolia record model for both tutorials and guides is based on the same base AlgoliaRecord type, though each content type has slight differences in what "optional" attributes are available.

This work implements a typer based CLI to run workflows through subcommands, such as astropylibrarian index tutorial or astropylibrarian index guide or astropylibrarian delete. The README includes the help output from the CLI.

This PR updates the tutorial index workflow to work a "new" HTML structure for pages where instead of <div class="section"> elements containing each section, the semantic <section> element is used instead. This new structure is appearing in our current learn.astropy.org tutorials. The old structure is still seen in the CCD processing guide. This PR enables the section iterator to handle both.

Finally, this PR introduces a feature for expiring old records (which can happen if we re-index a guide/tutorial, and that content no longer has a section, or section changes title, etc.). This is done by creating a unique index_epoch key for each indexing even. After indexing, we look for records corresponding to that root_url that have a different index_epoch value than the records we just saved.

jonathansick commented 3 years ago

Thanks @adrn !