Page Indexing - Githubissues

Softdev1 commented 1 year ago

There is a need for https://flywheeloutput.com/ to be indexed properly and also for the IQ-GPT too

Yadheedhya06 commented 1 year ago

scrap flywheel and store data into DB later use that to answer query using iq gpt and code should be scalable to scrap other substack sites in the future

kesar commented 1 year ago

take in consideration use langchain as part of the whole stack: https://python.langchain.com/en/latest/modules/indexes/document_loaders.html

https://flywheeloutput.com/sitemap.xml

https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/sitemap.html

s-1-n-t-h commented 1 year ago

scraped all flywheel articles. found bs4 returning better & clean content from sites compared to lang chain doc loader.

observed at: https://github.com/EveripediaNetwork/iq-search/blob/main/Notebook%20to%20update%20supabase/scraping%20flywheel/scraping%20test%20for%20flywheel.ipynb

scraping notebook using bs4: https://github.com/EveripediaNetwork/iq-search/blob/main/Notebook%20to%20update%20supabase/scraping%20flywheel/scraping%20whole%20flywheel%20articles.ipynb

dataset: https://github.com/EveripediaNetwork/iq-search/blob/main/Notebook%20to%20update%20supabase/scraping%20flywheel/data/flywheel.csv

Softdev1 commented 1 year ago

@Yadheedhya06 Please update this page with the current progress of this issue 👍

s-1-n-t-h commented 1 year ago

csv for converting to embeddings: https://github.com/EveripediaNetwork/iq-search/blob/main/Notebook%20to%20update%20supabase/pre%20processing%20articles/flywheel_data.csv

kesar commented 1 year ago

this task has been for 2 week in progress. Let's split into small deliverables to keep track of progress 👍🏻

Yadheedhya06 commented 1 year ago

already progressed in jupyter notebook but now we refactored code into modular and completed work for indexing https://github.com/EveripediaNetwork/iq-search-ingestor have to update database with scraped content left

Yadheedhya06 commented 1 year ago

finished scraping task for flywheel w sitemaps made scalable to scrape in future integration

This scrapes the URLs which are edited/created on the day - https://github.com/EveripediaNetwork/iq-search-ingestor/blob/main/src/dataProcessing/fetcher/detectNewUrls.py
This is content scraper from URL - https://github.com/EveripediaNetwork/iq-search-ingestor/blob/main/src/dataProcessing/fetcher/contentScraper.py
Returns all the URLs for the sitemap URL of any website - https://github.com/EveripediaNetwork/iq-search-ingestor/blob/main/src/dataProcessing/fetcher/scrapeSitemapXmlUrls.py

@kesar please review and if it's good we can close this issue. Updating DB task made in issue #1222

kesar commented 1 year ago

this task has been for 2 week in progress. Let's split into small deliverables to keep track of progress 👍🏻

@Softdev1 split tasks, close this one and assign to devs pls

@kesar please review and if it's good we can close this issue. Updating DB task made in issue https://github.com/EveripediaNetwork/issues/issues/1222

I dont see a very granular split of tasks. I dont see any task that yaswanth is working on. I think there are many tasks that needs to be done in order to have this working on production that are not in 1222 (that its again pretty generic and big).

task for cron, task for update, task to convert to embeddings, task to flywheel ,etc etc.

also #1222 is in ready to work, and I think you are working on that (based on your dailies), so it should be in progress.

In order to see progress we shouldnt create tasks that are in the kanban for more than 2-3 days. If a task its in kanban for longer, its that we did a bad exercise on splitting issues. (like this one that has been 2 weeks) 👍🏻

Yadheedhya06 commented 1 year ago

@Softdev1 @kesar . Maybe code refactoring is left. Embedding generation, DB updation are already done. cron integration is progress will be done by today(friday) DB updation : https://github.com/EveripediaNetwork/iq-search-ingestor/blob/main/src/utils/updateSupabase.py Embedding gen : https://github.com/EveripediaNetwork/iq-search-ingestor/blob/main/src/utils/generateEmbedding.py have to integrate these modules in main according to workflow it wont take much time i guess - https://github.com/EveripediaNetwork/iq-search-ingestor/blob/main/src/main.py

@s-1-n-t-h can work on code refactoring/ improving the internal structure code and making present scripts into PEP8 convention. While parallely I work on cronjob integration and documentation/readme for splitting mechanisms and answer query mechanisms for IQ SEARCH

kesar commented 1 year ago

is it flywheel already into our db?

Yadheedhya06 commented 1 year ago

we are thinking to discard old tables in DB. those tables dont have latest wikis and wikis which are edited after table creation we didnt update them. So we can discard old table insert wikis freshly and this time flywheel data also. Parallely integrate cronjob with ingestor repo so that we dont miss any wiki edits or uploads

is it flywheel already into our db?

kesar commented 1 year ago

why there are not issues about all the stuff you are mentioning?

Yadheedhya06 commented 1 year ago

thought we should wait till ingestor repo is ready with workflow and automation to update DB with wikis and flywheel contents freshly. Because even if we create discard DB and re insert data, we miss the edits and flywheel new articles meanwhile. so now cronjob is ready and ingestor workflow is ready. Now, we created #1243 working on DB re-insert wikis and updating pg function

why there are not issues about all the stuff you are mentioning?

kesar commented 1 year ago

cool yeah, we can wait but what i meant its to create th issues so we can see whats left and the progress 👍

not need to do everything at the same time, but without proper task this is too chaotic and delivering takes longer than expected.

for example, we talked about getting flywheel like 2 or 3 weeks ago already and you are 2 ppl fully working on this

Yadheedhya06 commented 1 year ago

yes, i think we are good now and almost ready with DB updation automatically and DB will be ready by tommorow morning. tho we have to later create an issue for code refactoring in ingestor repo without disturbing workflow automation

EveripediaNetwork / issues

Page Indexing #1177