ClimateMisinformation / Scrapers

Web scrapers
5 stars 1 forks source link

Scrape script `breibart-defense.py` is failing with no such file or directory #13

Closed ebrucucen closed 3 years ago

ebrucucen commented 3 years ago

[x] What is the trigger: Running the breibart-defense script cause the fail error message

python3 ./scrape-scripts/breibart-defense.py 

[x] What is the error message:

Traceback (most recent call last):
  File "breibart-defense.py", line 72, in <module>
    pandas.DataFrame(articles).to_csv('data/breibart-defense.csv', index = False)
  File "/usr/local/lib/python3.8/site-packages/pandas/core/generic.py", line 3170, in to_csv
    formatter.save()
  File "/usr/local/lib/python3.8/site-packages/pandas/io/formats/csvs.py", line 185, in save
    f, handles = get_handle(
  File "/usr/local/lib/python3.8/site-packages/pandas/io/common.py", line 493, in get_handle
    f = open(path_or_buf, mode, encoding=encoding, errors=errors, newline="")
FileNotFoundError: [Errno 2] No such file or directory: 'data/breibart-defense.csv'

[x] What is the expected behaviour: /data/breibart-defense.csv file to be populated with the links and articles

alexn11 commented 3 years ago

These scripts should be launched from the root of the projects, where the 'data' folder is situated, I think that's the origin of the error message.

ebrucucen commented 3 years ago

These scripts should be launched from the root of the projects, where the 'data' folder is situated, I think that's the origin of the error message.

Fair comment Alex, I think a consistent approach across all script would definitely be better off. Carbon_sense, cato-institute does output to their current directories.

df.to_csv("carbon_sense.csv", index=False)

Taking your suggestion into account, we need an execution script to guide us (especially newbies like me), so, I will close this one and create another issue to have a run-script, if you agree?

alexn11 commented 3 years ago

Sure why not. But we should also consider that these scripts were more like for a one-off scraping rather than regular. We might want to do regular rescraping (not sure about that), in that case we would need to rewrite the scripts so that they don't redownload everything again. Having set up a consistent approach would be great, if we need to do more scraping.

ebrucucen commented 3 years ago

I see your point, so is a regular scaping one would be the next step? When we want "live" data consumption, and track the "new" news about climate misinformation?

alexn11 commented 3 years ago

Well that's a good point. I think that'd be good to discuss this at the next meetup. I missed the last one unfortunately so someone else might be better able to tell what's the current priorities but seeing all the activity I understand that people are labelling the data. That is in line with the idea to have a working product as soon as possible and then see how to improve that (this idea seemed to have good support during the last meetup I went to, including mine).

That being said, that wouldn't hurt to think about longer term and try to make things (such as the earlier parts of the pipeline) work smoothly. So maybe we want to start setting up our scraping scripts in a way that they would run on a regular basis. Maybe we should discuss this on the slack see what people think about it.