astronomer / ask-astro

An end-to-end LLM reference implementation providing a Q&A interface for Airflow and Astronomer
https://ask.astronomer.io/
Apache License 2.0
192 stars 47 forks source link

Add ingest for stack overflow #126

Closed mpgreg closed 9 months ago

mpgreg commented 10 months ago

Please describe the feature you'd like to see Current ingest only includes stack overflow from a point-in-time (dated mid-Sep 2023) capture of archive https://archive.org/details/stackexchange. Initial discussion questioned terms of use stack overflow.

Further discussion indicates general argument seems to be that content is licensed under CC license which requires attribution. and if you fine tune a model on stack overflow data, attribution gets tricky. I guess for RAG it’s fine? https://news.ycombinator.com/item?id=35647257

Describe the solution you'd like Add extract function with stack api for extracting latest stack overflow messages.

Are there any alternatives to this feature? Stay with archive only. But that means retrieval will degrade a bit over time.

Additional context

Acceptance Criteria

Note:

mpgreg commented 10 months ago

Fix is at https://github.com/astronomer/ask-astro/commit/47933b52dc8da905fdbb4fe1627d35f1254a98a7 and ready for PR.

mpgreg commented 10 months ago

132 adds logic for stack API extract but should be left open until we can figure out why it only grabs 200 posts. Appears to be a API rate limit of some sort but documentation is unclear.

Lee-W commented 10 months ago

https://api.stackexchange.com/docs/types/post https://api.stackexchange.com/docs/comments

sunank200 commented 10 months ago

Note:

vatsrahul1001 commented 9 months ago

@Lee-W Can we initiate QA with test bot if it's pointed to the correct db?