dodeeric / langchain-ai-assistant-with-hybrid-rag

This is a LLM chatbot coded with LangChain. The web interface is coded with Streamlit. It implements a hybrid RAG (keyword and semantic search) and chat memory.
https://bmae-ragai-webapp.azurewebsites.net
GNU General Public License v3.0
8 stars 1 forks source link

scrape all bmae urls with admin interface once refactoring of json file names done #83

Open dodeeric opened 2 weeks ago

dodeeric commented 2 weeks ago

get the urls:

  1. commons categories: easy to list (list of commons-Category json files)
  2. balat & belgica: xxxx-urls-dsX.txt

europeana: retrieve url field for each json file

dodeeric commented 2 weeks ago

jq '.[].url' commons-Category_Engravings_by_Dodeeric-swp.json

dodeeric commented 2 weeks ago

$ jq '.[].url' *.json

$ jq '.[].url' *.json | wc -l 1985


A) for commons: scape by Category

==> list all the categories:

ls | grep commons-Category | cut -c18-170 | cut -d"." -f1 | sed "s/-swp//g" > ../../list-commons-categories.res

B) for all the other urls: scrape by URLs

list of urls per type:

==> jq '.[].url' balat.json | cut -d'"' -f2 ===> 230 urls ==> jq '.[].url' belgica.json | cut -d'"' -f2 ===> 52 urls ==> jq '.[].url' www.europeana*.json | cut -d'"' -f2 ===> 234

dodeeric commented 1 week ago

$ jq '.[].url' balat.json | wc -l 230 $ jq '.[].url' belgica.json | wc -l 52 $ jq '.[].url' www.euro.json | wc -l 234 $ jq '.[].url' commons.json | wc -l 1467

===

Total: 1983 (1985 in admin list because there is two opac files: we can omit them)

dodeeric commented 2 days ago

Number of JSON files: 343

Number of web pages: 1985

Number of PDF files: 1

Number of PDF pages: 2

Number of web and pdf pages: 1987

The Chroma vector DB is located on myvm2.edocloud.be:8000.

DB size: 49800.0 KB

dodeeric commented 2 days ago

-rw-rw-r-- 1 dodeeric dodeeric 4340 Jun 28 14:27 list-commons-categories.res ==> 104 categories -rw-rw-r-- 1 dodeeric dodeeric 15856 Jun 28 14:32 list-europeana-urls.res -rw-rw-r-- 1 dodeeric dodeeric 9355 Jun 28 14:31 list-irpa-urls.res -rw-rw-r-- 1 dodeeric dodeeric 2600 Jun 28 14:31 list-kbr-urls.res