Courtlistener and commoncrawl as potential big open source repos for text

dlab-trainings / social-data-carpentry-2015

Inaugural hackathon for Social Software Carpentry

Other

6 stars 1 forks source link

Courtlistener and commoncrawl as potential big open source repos for text #15

Open rdhyee opened 9 years ago

rdhyee commented 9 years ago

Re a comment by @davclark on email ("If folks have already-developed datasets that are amenable to a range of text processing, please let me know!"):

See https://www.courtlistener.com/, especially https://www.courtlistener.com/api/. I know some of the folks behind the CourtListener (from my I School Days): @mlissner and @brianwc of http://freelawproject.org/
there's always web text -- look at https://commoncrawl.org/ (cc @Smerity)

mlissner commented 9 years ago

Thanks for the shout out @rdhyee. Not sure the context here, but if folks have questions or need help with CourtListener data, I'm happy to help.

davclark commented 9 years ago

@mlissner you're showing up just at the right moment. First of all, we already have good, open text data with obvious analyses that would have broad appeal. See #11 and #13 for progress on congressional records.

Second, to clarify (and as you can see in the above issues), we have textual data that is open. What we need are textual data sources with curricula and documentation around them, that are already in a format that is amenable to loading into a data frame in R, etc.

So, @mlissner do you have that? Or @rdhyee?

davclark commented 9 years ago

Also, thanks @rdhyee for getting this issue out of an email on a google event! :+1:

mlissner commented 9 years ago

Well, our data is available as JSON, XML, or a couple other more esoteric formats. Does that qualify?

davclark commented 9 years ago

Man, it's hard to explain what I mean. But thanks for sticking with me.

I need a documented workflow in which it is not much work for beginners to get into tidy format in R. Metadata should be clear, and it should be easy to generate term-document matrices using the tm package and so on.

For the congressional record: I have data in JSON and XML, including a flexible command line script, as well as python bindings, and my own API wrapper I wrote to hit the REST endpoint directly (also in python). We also have the wikipedia page which highlights interesting moments in the congressional record, as well as a nice two-party ideology split that will be observable in the data.

I want something that takes less work than working with what I already have!

rochelleterman commented 9 years ago

I thought we settled on the sunlight data we got via the ipython notebook that Dillon and I wrote?

mlissner commented 9 years ago

Well, I don't follow all that, but let me know if I can help (and if you didn't already decide to go with the sunlight data).

davclark commented 9 years ago

@rochelleterman I'm done talking about this on the issue for today. Are you committed to converting the file to the tabular format you'd need for tm? I probably don't have time until September or so.

rdhyee commented 9 years ago

Thanks @mlissner for jumping into this discussion. About CommonCrawl data, I would guess there's not "curricula and documentation around them, that are already in a format that is amenable to loading into a data frame in R"

rochelleterman commented 9 years ago

It's already converted. Check out "data.csv" in the repo. Sorry if I contributed to this confusion. Anyway, thanks @mlissner and @rdhyee for the heads up about CourtListener data. Looks like a fantastic resource that I will definitely keep in mind when constructing text-related curriculum, including for Data Carpentry.

brianwc commented 9 years ago

We're moving this repo to a different github account soon, but you can see our Supreme Court JSON looking pretty @brianwc/bulk_scotus

On July 27, 2015 4:10:09 PM PDT, Rochelle Terman notifications@github.com wrote:

It's already converted. Check out "data.csv" in the repo. Sorry if I contributed to this confusion. Anyway, thanks @mlissner and @rdhyee for the heads up about CourtListener data. Looks like a fantastic resource that I will definitely keep in mind when constructing text-related curriculum, including for Data Carpentry.

Reply to this email directly or view it on GitHub: https://github.com/dlab-berkeley/social-data-carpentry-2015/issues/15#issuecomment-125375523

Sent from my Android phone with K-9 Mail. Please excuse my brevity.