euagendas / semeval_8_2022_ia_downloader

internet archive downloader for task 8 at semeval
Other
7 stars 3 forks source link

Cant download dataset #4

Open Lukecn1 opened 2 years ago

Lukecn1 commented 2 years ago

However I have issues in downloading the data as there are many of the links that are no longer working and therefore cannot be scraped.

This is even true for the sample_data.csv, where a large percentage is missing one or both articles in the pair.

Are you able to share the evaluation dataset privately?

computermacgyver commented 2 years ago

Hi @Lukecn1 . Unfortunately copyright law prevents us from sharing the news articles directly :disappointed: Most articles are available on the Internet Archive, and the code should automatically try to download from there. The sample data was created earliest in the project before we started ensuring articles were on the Internet Archive; so, although the sample data may be missing most of the actual articles used in the SemEval competition should be available.

Lukecn1 commented 2 years ago

Thats fair, I hadn't considered the copyright aspect.

I experienced the same issue when scraping the evaluation dataset however.

I will try from scratch again, and see of maybe its an issue on my end.

intifa233 commented 2 years ago

Hi, This question may be stupid, I am just a beginner at python. I created a new environment successfully installed the requirements.txt. Also the downloader by "pip install semeval_8_2022_ia_downloader". When I used "python -m semeval_8_2022_ia_downloader.cli --links_file=input.csv --dump_dir=output_dir", it said "FileNotFoundError: [Errno 2] No such file or directory: 'input.csv'". Would you please tell me what should I do? Thank you!

computermacgyver commented 2 years ago

Welcome @intifa233 . All questions are good ones. I'm opening a separate issue to discuss this. Please see #5