zstumgoren commented 9 months ago

The CPJ Database website has an underlying API that makes it relatively easy to grab their data.

I've written the first part of a Python script that performs a scrape of all the data in their database. The code is in the bna-dre repo and is named scrape_cpj.py.

The data retrieved from the API is in the form of JSON files, one per page and each page containing 20 records.

The data I just harvested (as of Jan 18, 2024 @ 3:30pm Pacific) can be found in this Dropbox zipfile.

See the CPJ JSON Files section below for an example.

Note that this is only partial work to get you started. To transform this into usable data for analysis and potentially display on a website, youi'll need to perform the following tasks:

[x] Parse all the JSON files and merge them into a single CSV
[ ] Scrape the "detail" page content that includes a description for each person. This is readily available in the source HTML of the page for each journalist. You can get the "detail" page for each journalist using the mtpage attribute in the JSON files (again, see the example below).

CPJ JSON Files

{
    "rowCount": 2284,
    "pageNum": 115,
    "pageSize": "20",
    "pageCount": 115,
    "data": [
        {
            "organizations": "Freelance",
            "fullName": "Zoreslav Zamoysky",
            "location": "Bucha",
            "status": "Killed",
            "typeOfDeath": "Dangerous Assignment",
            "charges": null,
            "startDisplay": "March 5 - March 15, 2022",
            "mtpage": "https://cpj.org/data/people/zoreslav-zamoysky/",
            "country": "Ukraine",
            "type": "Journalist",
            "motiveConfirmed": "Confirmed"
        },
<<< SNIPPED FILE AFTER FIRST RECORD >>>

r1ngs commented 9 months ago

Thanks Serdar!

@irenecasado wrote this code:

Provide the full path to the file or use Path

file_path = Path("./entries.json")

Reading JSON data from a file

with open(file_path) as f: json_data = json.load(f)

Assuming "data" is the key you want to keep

data_values = json_data.get("data", [])

Converting JSON data to a pandas DataFrame

df = pd.DataFrame(data_values)

df.to_csv("reporters.csv", index=False)

and we scraped the file and got a csv for all journalists killed around the world!

Then I wrote the code (which Irene corrected!) to turn it into a dataframe containing only the killings in Gaza:

import csv import pandas as pd csvreader = csv.reader(file) file = open('reporters.csv') type(file) csvreader = csv.reader(file) df = pd.read_csv('reporters.csv') filtered_df = df[df["country"] == 'Israel and the Occupied Palestinian Territory']

zstumgoren commented 9 months ago

@irenecasado @r1ngs Can you all drop code into a Python module or Jupyter notebook and commit/push to this repo? That way we're centralizing all the code and it's readily available for anyone on the team (or me) to review/use.

Next week in class we'll learn about Git version control and how to push/pull code from GitHub to your local machines. In meantime, feel free to just manually upload the additional code you mentioned in the last comment. Thanks!

r1ngs commented 9 months ago

I've added that now - https://github.com/r1ngs/bna-dre/blob/main/scrape_IRE

zstumgoren commented 9 months ago

@r1ngs The scrape_IRE file doesn't appear to have any code. Can you add code and also give the file the appropriate extension (either .py or ipynb, as appropriate)? Using the proper extension is important to signal the nature of the file and has the added benefit that on GitHub, Jupyter notebooks render visually if you use the proper extension. Ping back when you've had a chance to do that and I'll take another look. Thanks!

luyi-eve commented 9 months ago

@zstumgoren hi just tried to create a new .ipynb file for Irene‘s codes - https://github.com/r1ngs/bna-dre/blob/main/Scrape_Irene_updated.ipynb - let us know if that works. Thanks!

zstumgoren commented 9 months ago

@luyi-eve @irenecasado So to be clear, you mean you manually downloaded all the data from a single API call rather than paging through it? The link in the script doesn't appear to work when I test in a web browser, so hoping you can provide a working example of the link and confirm the strategy being used.

irenecasado commented 9 months ago

Hey @zstumgoren, I just updated the link on the code that Eve provided. Just in case, here it is again:

https://datamanager.cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=3000&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27,%27Unconfirmed%27))&in(type,%27Journalist%27,%27Media%20Worker%27)&ge(year,1992)&le(year,2024)

I modified the pageSize to 3000 to get all the entries of the database and created a json file. Then, I converted the json file into a csv following the steps included in the code. Does it make sense? Let me know if you have any questions.

zstumgoren commented 9 months ago

Yep, the key step that I didn't see in the code was the modification of the GET params to pull everything in one API call. Nicely done! 👍

zstumgoren commented 9 months ago

@irenecasado You might want to also update the pageSize param to 1 instead of 18 or try removing it entirely to guard against any potential data loss. Also as a data integrity check, please cross reference the overall record count in the JSON (ie count the rows) against the site's figure, and eyeball the first few and last records and compare to entries on the first and last pages of the site. That'll help ensure we got all the records. Can you post back here with confirmation once that's verified?

zstumgoren commented 9 months ago

@irenecasado Apologies I meant to say pageNum param above and I misread the API call - looks like it's already 1 and not 18 (that ampersand fooled my old eyes). So that's good. If we do the data integrity checks mentioned above we can close the loop on this ticket. Thanks!

irenecasado commented 9 months ago

Hey @zstumgoren I have done the data integrity checks and everything looks good! We can close this ticket :)

irenecasado commented 9 months ago

Hey @zstumgoren I have done the data integrity checks and everything looks good! We can close this ticket :)

zstumgoren commented 9 months ago

@irenecasado Awesome. Nicely done!

gaza-reporters / gaza-reporters.github.io

Finish scraping/process CPJ data #8

CPJ JSON Files

Provide the full path to the file or use Path

Reading JSON data from a file

Assuming "data" is the key you want to keep

Converting JSON data to a pandas DataFrame