The website is really small (less than 50 pages of content), expecting minimum to no disruption to existing retrieval quality.
Technical Details
airflow/dags/ingestion/ask-astro-load-cosmos-docs.py: incremental ingestion that runs periodically
airflow/dags/ingestion/ask-astro-load.py: added extract and ingest cosmos docs in bulk load
airflow/include/tasks/extract/cosmos_docs.py: main file that handles the extraction from cosmos website
Note: only the main body of each page is extracted to minimize noise so the parsing logic is written tailoring to this data source (e.g. looking for article tag in the html body)
Tests
This is what the dataframe looks like after extracting (very small website not a lot of content overall)
df_dump.csv
Airflow Ingestion UI confirmation that it is working for bulk load
Incremental ingestion works successfully on Airflow UI
Retrieval Quality Evaluation
Existing questions have no quality degradation
New questions specific to cosmos are correctly answered (see csv below, note: ordering of the references in this csv is not the same as actual result)
cosmos_ingest_1.csv
Description
Technical Details
article
tag in the html body)Tests
Retrieval Quality Evaluation
closes #277