Update file names for existing transcript data

DamonCharlesRoberts commented 1 year ago

Go through each transcript and make sure that the PDF contains a transcript for a hearing for one nominee. Split and create a new pdf for any additional nominees.
Name each PDF (which should be one per nomination hearing) with the following convention:
- last-name_month_day_year.pdf
- e.g., brown_10_22_2003.pdf
Place the pdf document in a folder based on whether the nominee is POC or White and whether they are Male or Female.

madelinemader commented 11 months ago

For the transcripts where we need to split them, what information exactly is needed? For example, the transcript “bryant_09-26-2006” includes the hearing for Vanessa Bryant and Michael Wallace. But the way the hearing is organized, Bryant and Wallace are presented, then give statements, then there’s witness statements, then the written question and answers. So for this one, do I just need to pull the section for Bryant from “statements of the nominees” or is other information needed?

DamonCharlesRoberts commented 11 months ago

Hmmm, good question. Some of these are a lot more messy than I had realized in the past. So I think this is forcing us to make a decision here.

In terms of interruptions, we ONLY want the text from the transcripts of the hearing -- what was said in real-time during the hearing.

I do think, that there is something interesting to be gleaned from the written remarks and the questions they were asked in regards to our second question in the project which is, "Is the topic of the questions and answers between nominees based on their gender and racial/ethnic identity different?" I think that we could get some useful insights there when considering the full text -- which I am realizing for a lot of these we had used in the past.

So, my initial thoughts are to just keep everything. Our model for interpretations doesn't standardize, it just takes raw counts of interruptions -- which won't happen in the written statements -- so it won't influence our results there. But if we keep the stuff that comes with the transcripts, then we can say that the confirmation process is different between male and female and non-POC and POC nominees, not just the hearings. So for our discussion of the topic models and stuff would need to be broadened to the whole confirmation process rather than just the hearing, but yeah.

What say you @madelinemader and @tylerpgarrett?

madelinemader commented 11 months ago

But is the relevant portion only the spoken statements for the nominees? In the above example, there are two nominees, and both of their spoken statements come before the written statements and before spoken witness statements but after spoken statements from the senators. So for the purposes of splitting the transcripts for the two nominees in this session, should i just pull the spoken statements of the nominees portion?

DamonCharlesRoberts commented 11 months ago

Yeah, so the thing that is most relevant are the spoken statements. So if we can't split things up very cleanly between nominees, we should focus on splitting on what we can in the spoken statements as best as we can

DamonCharlesRoberts / judiciary_nominations

Update file names for existing transcript data #26