refactor(airflow): refactor extract_astro_blogs method

astronomer / ask-astro

An end-to-end LLM reference implementation providing a Q&A interface for Airflow and Astronomer

https://ask.astronomer.io/

Apache License 2.0

192 stars 47 forks source link

refactor(airflow): refactor extract_astro_blogs method #176

Closed Lee-W closed 9 months ago

Lee-W commented 10 months ago

When parsing links, parse the article tag to get the date to filter. Break the loop if any dates are filtered in an iteration.

closes: #116

cloudflare-workers-and-pages[bot] commented 10 months ago

Deploying with Cloudflare Pages

Latest commit:	`273e6dc`
Status:	✅ Deploy successful!
Preview URL:	https://a2fd9648.ask-astro.pages.dev
Branch Preview URL:	https://refactor-blog-extraction.ask-astro.pages.dev

View logs

Lee-W commented 10 months ago

I ran these function and compare the results. Let me run the tests on DAG as well

Lee-W commented 10 months ago

@sunank200 I've tested with local DAGs

sunank200 commented 10 months ago

@sunank200 I've tested with local DAGs

@Lee-W For each of these changes in ingestion. We need to do the following:

Ingest the data to the dev database with a change in ingestion code.
Request @vatsrahul1001 to check if the retrieval performance as degraded or not.

Lee-W commented 10 months ago

Ingest the data to the dev database with a change in ingestion code.

May I know where is the dev database and dev airflow environment we can ingest? also what is the change you mentioned? do you mean the code change in this PR?

mpgreg commented 10 months ago

One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date.

Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.

Lee-W commented 10 months ago

One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date.

Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.

I guess the reason behind it might be the same as what we did on StackOverflow? But I'm ok with it as well. I'll hold the work on this PR. Till we have a decision

sunank200 commented 10 months ago

One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date. Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.

I guess the reason behind it might be the same as what we did on StackOverflow? But I'm ok with it as well. I'll hold the work on this PR. Till we have a decision

@Lee-W can you setup a call for this please and lets have a decision on this? After that this change should be tested with fresh ingestion on dev database.

Lee-W commented 10 months ago

One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date. Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.

I guess the reason behind it might be the same as what we did on StackOverflow? But I'm ok with it as well. I'll hold the work on this PR. Till we have a decision

@Lee-W can you setup a call for this please and lets have a decision on this? After that this change should be tested with fresh ingestion on dev database.

I just sent a message to the channel for discussion. But IMHO, this might not even be directly part of the PR.

Lee-W commented 10 months ago

as discussed earlier, we still need this cutoff date. will proceeded testing on this one

Lee-W commented 10 months ago

I changed the host and pointed to the dev WEAVIATE. @vatsrahul1001 Could you please help with the testing? Thanks!

Lee-W commented 10 months ago

@vatsrahul1001 We no longer need to test it on cloud. Thanks!

@sunank200 I've tested it and send you the result. Could you please take a look? Thanks!