Open cakiki opened 2 years ago
Processed version exists here: https://github.com/CogentMentat/FRevNCA_CuratedData, but not from the data creators.
@davanstrien The data itself will be TEI XML
; what sort of loading script do you reckon we should create for that?
@cakiki thanks for suggesting! I think for TEI it depends a little bit on how many of the fields we want to extract. If we want to keep more true to the original we probably want a custom datasets script to parse the XML into a nice structure. One potential issue with that is it can be quite slow. Depending on how much that becomes an issue we could also look into creating a 'preloaded' version. I.e. parse the data and save it in a format that datasets can load very quickly.
@davanstrien Ah good thinking; maybe two configs then? One for people who only care about text, and another (closer to original) structured version?
(This issue will probably come up more often so we might want to have guidelines)
@cakiki I can help document this dataset if useful.
@kmcdono2 That would be great! (I'm also happy to sit this one out if you'd like to take ownership of the issue)
@cakiki I'm not here to step on your feet, but absolutely happy to help. This dataset from the PNAS paper is a subset of the larger FRDA AP dataset available from Stanford. Perhaps useful to load both of them! I could start a new ticket for the other dataset (which covers a longer period, but has not been curated to the same extent as this one). But, I would suggest renaming this to highlight that it's only 1789-91.
PS - the FRDA Images dataset would be GREAT to add as well! I can ask about access to that in bulk.
I've already opened an issue for the photo dataset here: https://github.com/bigscience-workshop/lam/issues/34 :smiley: Feel free to assign yourself there!
I think it would be great to have the PNAS subset (This paper is how I've come to discover the dataset!), but I wonder if it would make more sense to have one main dataset with multiple named configurations, one for every version. WDYT?
Initial commit here: https://huggingface.co/datasets/biglam/archives_parlementaires_revolution_francaise
TODO: yaml header and documentation.
@kmcdono2 Feel free to start documenting! Will start working on loading scripts soon.
Still want to check the processed version of the PNAS paper: https://github.com/CogentMentat/FRevNCA_CuratedData
A URL for this dataset
https://frenchrevdata.github.io/
Dataset description
Some work required to disambiguate between the different landing pages.
Dataset modality
Text
Dataset licence
No response
Other licence
No response
How can you access this data
Other
Confirm the dataset has an open licence
Contact details for data custodian
ssussman at stanford.edu