HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents
https://htr-united.github.io
Creative Commons Zero v1.0 Universal
37 stars 31 forks source link

Contact Kim Pham - JCRS 2020 #59

Closed PonteIneptique closed 1 year ago

PonteIneptique commented 2 years ago

Kim Pham has published an awesome dataset:

Might be worth to contact her

PonteIneptique commented 2 years ago

Ok trouvée sur github aussi mais toujours pas d'adresse email: @kimpham54

PonteIneptique commented 2 years ago

Dear Kim, in case you get a notification and read this, we are just trying to reach out and get in touch to know if you'd be okay to document the wonderful dataset that is on Zenodo through https://htr-united.github.io/document-your-data.html ? :)

kimpham54 commented 1 year ago

hi @PonteIneptique, sorry I just saw this message!

Yes, that should be fine, but is it alright if you fill out a Terms of Use form?

kimpham54 commented 1 year ago

also feel free to get in touch with my github username at gmail

PonteIneptique commented 1 year ago

Hi @kimpham54 ! We are not specifically looking for using your dataset (at least I am not, or not right now), but we are trying to catalog as much open data as possible for HTR United ( https://htr-united.github.io/catalog.html ), and that's why we are trying to contact people behind datasets :) I could be interested in having access to the dataset but mostly to compute metrics (number of chars, lines, regions, files) to make cataloging "better". I could definitely sign a Terms of Use for this purpose :)

IF you have time, would you be willing to submit this form to document this dataset ? There is a very simple form there: https://htr-united.github.io/document-your-data.html

kimpham54 commented 1 year ago

htr-united.zip Attached are the metadata files plus terms of use form that you can include in the data files. Feel free to also sign the terms of use. Thank you!

PonteIneptique commented 1 year ago

Thank you ! We'll most likely only publish the open dataset ( https://doi.org/10.5281/zenodo.4242885 ) as it is the only one containing only ground truth. Would it be ok for you if however, I'd change:

description: Training and validation set. Transcribed records available upon request.

to

description: Training and validation set. Transcribed records ( https://doi.org/10.5281/zenodo.4150880 ) available upon request: access of the transcribed dataset is mediated upon filling out a terms of use.  https://specialcollections.du.edu/cad/form/termsOfUse. Contact author for more details. 

We could also add part of the form you sent, specifically:

The transcribed corpus of records from the Jewish Consumptive Relief Society contains data that include individually identifiable health information, among other sensitive information regarding persons and people. All individuals for whom records are provided have been deceased for at least 70 years, but were they still living today, these records would be recognized as being protected health information under the US Health Insurance Portability and Accountability Act of 1996 (HIPAA). While HIPPA and other privacy laws no longer apply to these individuals, in providing these data the University of Denver wishes to foster research practices that express the utmost respect for the human beings whose lives are represented, at least in some part, in these collections. In addition, we ask researchers respect the lives of these individuals’ ancestors and their communities. To foster practices that honor patients, staff, nurses and physicians connected with the JCRS Sanitorium, as well as their families, ancestors and communities, we ask that researchers disclose their intended use of the collection for review by our Advisory Board (see reverse). This Board is comprised of ethicists, historians, librarians, attorneys, physicians, and members of the Jewish community. In addition, we ask researchers agree to conduct their work under the following set of principles: I affirm the role of JCRS patients and staff as data creators and will avoid exploiting and/or dehumanizing them by treating them simply as data. My research will, when possible and appropriate, account for the contexts surrounding the JCRS subjects as data arise. My work will recognize that all data and datasets are shaped by decisions about how histories are recorded, remembered, and valued. If the nature of my work is such that I am sharing the life stories and/or narratives of individuals in these data, and I can do so with no potential harm to their reputation or that of their ancestors, I will honor them by naming them. If the nature of my work is such that I am exploring large-scale patterns in the dataset, and naming individuals serves no specific research purpose, I will anonymize and/or redact names within the data. If I am publishing the results of research conducted with these data, I will, if possible and appropriate, include a note of recognition and/or gratitude in my publication. We suggest a version of: “This work was made possible in part by the patients, staff, nurses, physicians, and community of the Jewish Consumptive Relief Society (JCRS). The people who lived, worked, and died at the JCRS sought to relieve human suffering. I am grateful to them.”

?

PonteIneptique commented 1 year ago

I computed the following metrics on the public dataset:

volume:
    - {count: 36027, metric: "lines"}
    - {count: 2660, metric: "files"}
    - {count: 4254, metric: "regions"}
    - {count: 3494619, metric: "characters"}
kimpham54 commented 1 year ago

Sure, the changes sound good. Thanks

alix-tz commented 1 year ago

Thank you very much @kimpham54 ! :)