astronomer / ask-astro

An end-to-end LLM reference implementation providing a Q&A interface for Airflow and Astronomer
https://ask.astronomer.io/
Apache License 2.0
192 stars 47 forks source link

Ingest rst does not follow included references for extract #107

Closed mpgreg closed 9 months ago

mpgreg commented 10 months ago

extract_github_rst() does not follow includes or references to other rst docs. This means that much of the airflow docs content is not being ingested or is not able to reference to the correct page.

https://github.com/astronomer/ask-astro/blob/c45487c7f12a9424dbe885580c687e35e30b7de4/airflow/dags/ingestion/ask-astro-load-github.py#L46C10-L46C10

Need to ingest from scrape of airflow docs html pages instead. https://airflow.apache.org/docs/

mpgreg commented 10 months ago

Also need code to recursively walk the docs page and extract sub-pages too. Need html splitter code to split on h2 heading.

sunank200 commented 10 months ago

Note:

pankajastro commented 9 months ago

@sunank200 @mpgreg AFAIK we generate html from rst docs since we are ingesting html docs why do we need rst too or I'm missing something here

mpgreg commented 9 months ago

Yes, this issue was meant to be closed if/when we change to html ingest.

pankajastro commented 9 months ago

Yes, this issue was meant to be closed if/when we change to html ingest.

cc: @sunank200 @phanikumv

phanikumv commented 9 months ago

Closing as discussed with Pankaj and Ankit in the sprint planning call.