ClimateMisinformation / Scrapers

Web scrapers
5 stars 1 forks source link

Archived June 2022

We will focus on building bots to detect greenwashing.

Climate Battle Cards

Description of directories

scrape-scripts

Web scrapers. See column-description.csv for a description of the columns.

Put your scrapped data in the data directory.

Once everything has been scrapped run the data-merger.py script to merge all the data into one file (normally done only once).

normalizers

scripts by alex to normalize data according to agreed columns (see column-description.csv)

data

CSVs of scraped data, organised by data source.

format_data_for_doccano

Doccano has very specific data requirements, especially when it comes to the size of the csv files.

This Python script:

text_preprocessing

Implementation of text cleaning and document embedding

The document embedding techniques implemented:

model_evaluation

Evaluation of models and document embedding techniques

How to contribute

You can contribute to this repository by solving an issue and suggesting improvements/changes in the code, documentation and project organisation. Look for issues labeled "good-first-issue"

When contributing to this repository, please first communicate with the owners of this repository the change you wish to make via a GitHub Issues or our Slack Channel before starting to work on it.

If you are working on an existing issue please assign it to yourself using GitHub Issues so that it is visible what you are doing and unneeded replication is avoided. When you finish working on an issue unassign yourself.

To start contributing, follow the steps below

Clone the repo
Create a branch using git checkout -b feature-branch
Make the required changes
Create a pull request using below commands
    git add --all
    git commit -m "your commit message"
    git push origin feature-branch
Go to Repository
Create Pull Request against master branch
Add a suitable title and description to the pull request and tag the issue number in Pull Request description, if the pull request is related to some issue logged here: Issues
You're done. Wait for your code to get reviewed and merged.

If you need more information then read the CONTRIBUTING guide.

Git workflow

We do not yet need to deploy to a production env many times per day so we will use the GitHub flow strategy to merge changes. It is described here https://guides.github.com/introduction/flow/. The only GitHub Flow rule is Anything in the master branch is deployable Each change is reviewed on a feature branch then merged into master.