bigscience-workshop / lam

Libraries, Archives and Museums (LAM)
Apache License 2.0
82 stars 7 forks source link

Add dataset: archives_parlementaires_revolution_francaise #33

Open cakiki opened 2 years ago

cakiki commented 2 years ago

A URL for this dataset

https://frenchrevdata.github.io/

Dataset description

The Archives parlementaires is a chronologically-ordered edited collection of sources on the French Revolution. It was conceived in the mid 19th century as a project to produce a definitive record of parliamentary deliberations and also includes letters, reports, speeches, and other first-hand accounts from a great variety of published and archival sources. FRDA currently contains the AP volumes covering the years 1787-1794, which can be searched using ARTFL's PhiloLogic 4 open source software platform. The texts have been marked up using TEI so that speakers, places, dates, and terms in the published index can be easily found. Users can see both scanned images of the AP pages or just the texts.

The code and data available in our GitHub repo is derived from work done under the auspices of Stanford’s French Revolution Digital Archive. The FRDA project scanned, OCRed, and encoded the first 82 (of 102) volumes of the Archives parlementaires (AP), the record of speeches and deliberations from French Revolutionary constitutional and legislative assemblies. These volumes only cover the first five years of the French Revolution, from the Cahiers des États Généraux of 1789 until 4 January 1794. FRDA, a collaboration between faculty and students in the humanities at Stanford University, Stanford Libraries, and the Bibliothèque nationale de France (BnF), presented the AP through a user interface permitting basic keyword and chronological searching.

FRDA received praise from the community of scholars working on the Revolution, but developments in digital humanities methods, researcher requests, and newly available data motivate the original researchers to expand from the FRDA foundation.

The data available on this site is the product of data cleaning performed by ARTFL (The Project for American and French Research on the Treasury of the French Language) at the University of Chicago. As a result, these XML files contain fewer OCR errors and more consistent markup than the materials currently searchable through the FRDA interface.

Further Development Work is currently underway to disambiguate names with the XML corpus, linking each name to an individual. Many of these individuals (the parliamentarian deputies) are associated with biographical metadata in a database developed by the Service de la Bibliothèque et des Archives de l’Assemblée nationale. We anticipate building an interface to allow scholars to query the AP data using the biographical parameters in that database, but the database itself will not be included in the downloads available here.

Some work required to disambiguate between the different landing pages.

Dataset modality

Text

Dataset licence

No response

Other licence

No response

How can you access this data

Other

Confirm the dataset has an open licence

Contact details for data custodian

ssussman at stanford.edu

cakiki commented 2 years ago

Processed version exists here: https://github.com/CogentMentat/FRevNCA_CuratedData, but not from the data creators.

cakiki commented 2 years ago

@davanstrien The data itself will be TEI XML; what sort of loading script do you reckon we should create for that?

davanstrien commented 2 years ago

@cakiki thanks for suggesting! I think for TEI it depends a little bit on how many of the fields we want to extract. If we want to keep more true to the original we probably want a custom datasets script to parse the XML into a nice structure. One potential issue with that is it can be quite slow. Depending on how much that becomes an issue we could also look into creating a 'preloaded' version. I.e. parse the data and save it in a format that datasets can load very quickly.

cakiki commented 2 years ago

@davanstrien Ah good thinking; maybe two configs then? One for people who only care about text, and another (closer to original) structured version?

(This issue will probably come up more often so we might want to have guidelines)

cakiki commented 2 years ago

self-assign

kmcdono2 commented 2 years ago

@cakiki I can help document this dataset if useful.

cakiki commented 2 years ago

@kmcdono2 That would be great! (I'm also happy to sit this one out if you'd like to take ownership of the issue)

kmcdono2 commented 2 years ago

@cakiki I'm not here to step on your feet, but absolutely happy to help. This dataset from the PNAS paper is a subset of the larger FRDA AP dataset available from Stanford. Perhaps useful to load both of them! I could start a new ticket for the other dataset (which covers a longer period, but has not been curated to the same extent as this one). But, I would suggest renaming this to highlight that it's only 1789-91.

kmcdono2 commented 2 years ago

PS - the FRDA Images dataset would be GREAT to add as well! I can ask about access to that in bulk.

cakiki commented 2 years ago

I've already opened an issue for the photo dataset here: https://github.com/bigscience-workshop/lam/issues/34 :smiley: Feel free to assign yourself there!

I think it would be great to have the PNAS subset (This paper is how I've come to discover the dataset!), but I wonder if it would make more sense to have one main dataset with multiple named configurations, one for every version. WDYT?

cakiki commented 2 years ago

Initial commit here: https://huggingface.co/datasets/biglam/archives_parlementaires_revolution_francaise

TODO: yaml header and documentation.

@kmcdono2 Feel free to start documenting! Will start working on loading scripts soon.

Still want to check the processed version of the PNAS paper: https://github.com/CogentMentat/FRevNCA_CuratedData