Closed Lee-W closed 9 months ago
Latest commit: |
273e6dc
|
Status: | ✅ Deploy successful! |
Preview URL: | https://a2fd9648.ask-astro.pages.dev |
Branch Preview URL: | https://refactor-blog-extraction.ask-astro.pages.dev |
I ran these function and compare the results. Let me run the tests on DAG as well
@sunank200 I've tested with local DAGs
@sunank200 I've tested with local DAGs
@Lee-W For each of these changes in ingestion. We need to do the following:
Ingest the data to the dev database with a change in ingestion code.
May I know where is the dev database and dev airflow environment we can ingest? also what is the change you mentioned? do you mean the code change in this PR?
One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date.
Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.
One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date.
Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.
I guess the reason behind it might be the same as what we did on StackOverflow? But I'm ok with it as well. I'll hold the work on this PR. Till we have a decision
One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date. Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.
I guess the reason behind it might be the same as what we did on StackOverflow? But I'm ok with it as well. I'll hold the work on this PR. Till we have a decision
@Lee-W can you setup a call for this please and lets have a decision on this? After that this change should be tested with fresh ingestion on dev database.
One thing to think about here... The current code uses source-specific extract. The extract function is specific to astro blogs. Alternatively the code at https://github.com/mpgreg/ask-astro/blob/main/airflow/dags/ingestion/ask-astro-load-html.py has one extract function which scrapes any webpage given some parameters. I think it will generalize pretty well but have not really tested. The main problem for astro blogs is there is no generalizable way to specify a cutoff date. Perhaps that is okay. We can check with Juliana if there is some specific reason we don't want to include older blogs.
I guess the reason behind it might be the same as what we did on StackOverflow? But I'm ok with it as well. I'll hold the work on this PR. Till we have a decision
@Lee-W can you setup a call for this please and lets have a decision on this? After that this change should be tested with fresh ingestion on dev database.
I just sent a message to the channel for discussion. But IMHO, this might not even be directly part of the PR.
as discussed earlier, we still need this cutoff date. will proceeded testing on this one
I changed the host and pointed to the dev WEAVIATE. @vatsrahul1001 Could you please help with the testing? Thanks!
@vatsrahul1001 We no longer need to test it on cloud. Thanks!
@sunank200 I've tested it and send you the result. Could you please take a look? Thanks!
When parsing links, parse the article tag to get the date to filter. Break the loop if any dates are filtered in an iteration.
closes: #116