astronomer / ask-astro

An end-to-end LLM reference implementation providing a Q&A interface for Airflow and Astronomer
https://ask.astronomer.io/
Apache License 2.0
196 stars 47 forks source link

Need to consolidate to a single HTML ingest #164

Closed mpgreg closed 10 months ago

mpgreg commented 12 months ago

Please describe the feature you'd like to see Multiple extract functions use almost identical HTML extract logic.

Describe the solution you'd like Should consolidate to a single function if possible and use dynamic task mapping like github extract.

Are there any alternatives to this feature?

Additional context

Acceptance Criteria

Note:

sunank200 commented 11 months ago

Note:

sunank200 commented 11 months ago

Discussed with @pankajastro - a single HTML extractor makes sense.

pankajastro commented 11 months ago

just adding more here a single extractor makes sense but still we require a thin layer over it for the different sources because we need some different cleanup approaches for different sources for example in Astro SDK I'm excluding if the docs URL has "autoapi", "genindex.html", "py-modindex.html", ".md", ".py" but for provider excluding "_api", "_modules", "_sources", "changelog.html", "genindex.html", "py-modindex.html", "#". So at least a different task per source makes sense to me and if we keep a different dag per source then maybe it will be easy to run i.e we can upsert only the source we want

phanikumv commented 10 months ago

Created draft PR on this one

pankajastro commented 10 months ago

just marked PR as ready for review, would appreciate a review