cisnlp / GlotWeb

GlotWeb: Web Indexing for Low-Resource Languages -- under construction.
https://cis-lmu-glotweb.hf.space
Creative Commons Zero v1.0 Universal
10 stars 0 forks source link

Add to links active/inactive flag #9

Open chaoSefat opened 1 month ago

chaoSefat commented 1 month ago
chaoSefat commented 1 month ago
  1. remove self links from links to scrape i.e. all_links and get a dictionary of scraped link. add active '1' and also save the engines.
  2. convert all_data to a format where it is a dictionary with 'link' as keys for fast look up. Then use it to check if 'common-crawl' in engines : save 'cc-text' save engines link for all entries are active or not. if in active then active:1 else active: 0 else: use trafilatura scrape and filter

Then get info of that. check if it's category is commoncrawl and has snippet or not.

chaoSefat commented 1 month ago
def check_urls(url_list):
    result = []

    for url in url_list:
        try:
            response = requests.get(url, timeout=5)
            if response.status_code == 200:
                result.append({"link": url, "active": 1})
            else:
                result.append({"link": url, "active": 0})
        except requests.RequestException:
            result.append({"link": url, "active": 0})

    return result