chatnoir-eu / web-content-extraction-benchmark

Web Content Extraction Benchmark
Apache License 2.0
14 stars 4 forks source link

Dataset Download Failed #2

Open Aaawahe opened 10 months ago

Aaawahe commented 10 months ago

Hi, I am very impressed with the work you have done, particularly the organization of the dataset you provided in your GitHub repository.

I attempted to download the dataset using Git LFS, but unfortunately, I encountered the following message: "This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access."

I was wondering if there might be an alternative way for me to access the dataset or if there is any possibility of expanding the data quota. I believe your dataset is crucial for further exploration and analysis in future research.

azinekrami commented 3 months ago

Hi, I have the same problem and couldn't access the dataset. I would really appreciate it if you could tell me how to download these files.

LeDilam commented 3 months ago

I have try to download the project using "git clone" but it fails at downloading pages of the dataset. However, by navigating through folders on github website and downloading tar.gz archives, one by one, it works.

azinekrami commented 3 months ago

I did the same thing, but the size of the downloaded file is less than 1K and it cannot be unzipped.

LeDilam commented 3 months ago

I did the same thing, but the size of the downloaded file is less than 1K and it cannot be unzipped.

It had worked. I don't know how I did it (I know it was two manual download from github page). I can not do it again. I still have both complete archive and their decompressed version.