As a researcher, I want a book dataset that is based on the content and doesn't include irrelevant fields from zotero so that I can focus on the data specific to this project.

kmcelwee commented 3 years ago

Dev Notes

[x] #269
[x] create a new manage command similar to other data exports (reference_data in books and intervention_data in interventions); should extend the reference data command for reuse, similar to the way the intervention data export does
[x] export should be based on instance model, since that is what the reference data links to; should export every instance that is cited in a derrida work (i.e., Instance.objects.filter(cited_in__isnull=False), same as the existing zotero export) — but we'll need to check this filter at some point, because according to #246 this doesn't include all the books in the reference data export
[x] adapt or refer to logic in as_zotero_item method on Instance since the metadata we want to export should be similar; remove that method when we're done (or rename/reuse any code that's helpful)
[x] Ensure that integer slug work is tracked somewhere. Link to conversation
[x] Read Zotero code and fix gaps
[x] Do str(creator.person), not str(creator) 3054e0d8d793783449748dcffa48e083634c4f24
[x] Address titles/151. Link to conversation

export should include the following fields (preliminary list):

id : instance.get_uri(), should match book id in reference export
item_type: instance.item_type
work_title : work.primary_title
work_short_title : work.short_title
alternate_title: instance.alternate_title
work_year: work.year
copyright_year: instance.copyright_year
print_date: instance.print_date taking into account print date day/month/year known fieldst
work_authors: work.authors
publisher: instance.publisher
pub_place: instance.pub_place
is_extant
is_annotated
is_translation
has_dedication
has_insertions
copy
dimensions
work_uri
work_authors
work_subjects
languages
journal_title: instance.journal.name
book_title: instance.collected_in.display_title
book_title_uri: instance.collected_in.
start_page
end_page
has_digital_edition
uri [finding aid url: these almost certainly don't resolve anymore! can we get help from PUL to get catalog links for the same items?]
zotero_id (for compatibility with previous dataset)

books have a work/instance structure to handle multiple different editions and translations of the same work (including at least one case where there are multiple copies of the exact same edition, for which we use the copy field to distinguish)

item type can be book, book section, or journal; a book section should belong to a book (instance) in the database, and some publication metadata should be pulled from that book record (see the zotero data for an example)

I would be open to two exports, one for works and one for instances (editions? copies? books) if you think that would simplify things any and not be too much trouble for people to work with.

Questions from Zotero code

Do we want to collapse alternate_title and primary_title field?

template['title'] = self.alternate_title or self.work.primary_title

Nope

Why did we not want publication place for journal articles? Should we mimic this?

if self.pub_place.count() and not self.item_type == 'Journal Article':
   template['place'] = '; '.join([place.name for place in self.pub_place.all()])

This may have been because Zotero didn't handle journal articles

I have added a field contributors to handle all non-author creators. Is that appropriate?
- authorized_name (the default for __str__) for these creators is odd, examples below. Should I just use lastname_first instead? or would we want to preserve this information?

do str(creator.person), not str(creator)

Gerhardt, T. Editor Die philosophische Schriften (1890)
de Gandillac, Maurice Translator Encyclopédie (1966)
Barande, Ilse Translator Oeuvres complètes de Karl Abraham (1966)
Couturat, Louis Editor Opuscules et fragments inédits de Leibniz (1903)
Manheim, Ralph Translator Philosophie der Symbolischen Formen (1953)
Vaughan, Charles Edwyn Editor The Political Writings of Jean-Jacques Rousseau (1915)
Chaix-Ruy, Jules Translator Oeuvres choisies de Vico (1946)
Macquarrie, John Translator Being and Time (1962);Robinson, Edward Translator Being and Time (1962)
Gagnebin, Bernard Editor Dialogues (n.d.);Raymond, Marcel Editor Dialogues (n.d.)
David, Maxime Translator Dialogues sur la religion naturelle (1964)
Gibelin, Jean Translator Encyclopédie (1952)
Ruwet, Nicolas Translator Essais de linguistique générale (1963)
Kahn, Gilbert Translator Introduction à la métaphysique (1958)
Robert, Marthe Translator Journal (n.d.)
Camille, Georgette Translator L'Écriture chinoise considérée comme art poétique (1937)
Derrida, Jacques Translator L'origine de la géométrie (1962 A)
Derrida, Jacques Translator L'origine de la géométrie (1962 B)
Bianquis, Geneviève Translator La Naissance de la tragédie (1949)
Hyppolite, Jean Translator La Phénoménologie de l'esprit (1947)
Emile Chambry Translator La République (1956)
Hildenbrand, Hans Translator Le jeu comme symbole du monde (1960);Lindenberg, Alex Translator Le jeu comme symbole du monde (1960)
Gibelin, Jean Translator Leçons sur la philosophie de la religion (1832)
Bachelard, Suzanne Translator Logique formelle et logique transcendantale (1957 A)
Emile Chambry Translator Phèdre (1938)
Robin, Léon Translator Phèdre (1961)
Marc-Antoine Léonard de Malpeines Translator The Divine legation of Moses (1744)

kmcelwee commented 3 years ago

Questions

finding aid url: these almost certainly don't resolve anymore! can we get help from PUL to get catalog links for the same items?]

Double checked, and get 200s for all finding aid links. These look fine to me? Here are a few:

http://findingaids.princeton.edu/collections/RBD1/c10455
http://findingaids.princeton.edu/collections/RBD1/c10455
http://findingaids.princeton.edu/collections/RBD1/c10433
http://findingaids.princeton.edu/collections/RBD1/c9157
http://findingaids.princeton.edu/collections/RBD1/c9354
http://findingaids.princeton.edu/collections/RBD1/c8361

Dimensions column is empty, work_uri only has one value. Were these never really used?
I've used the filename derrida-instance-data.csv/json should I call it book-data? Something else?
Proposed unknown date format: create a function that puts the datetime into YYYY-MM-DD format, replacing question marks if the given time period is unknown. (e.g. ????-09-23, 1945-??-??)
I'm filtering by not in [None, '', []] to remove nulls and empty strings from the JSON. In the PR we merged, I just did a boolean (if ref[field]). There were no booleans in that CSV, but should I go back and fix that? Do we want empty strings removed as well or just nulls?

adapt or refer to logic in as_zotero_item method on Instance since the metadata we want to export should be similar

Not entirely sure how that method would help us here. Let me know if I'm missing something. We already removed it in #269 Here's a copy of the method: https://github.com/Princeton-CDH/derrida-django/pull/270/files

rlskoeser commented 3 years ago

I'm relieved the finding aid urls resolve. However, those are redirects and we should put the new urls into the dataset. They may not maintain the redirects indefinitely, and we want the data publication to be as durable as possible since we're not planning to touch this. It looks like we can just do a regex to get the new version, but let's make sure they all resolve if we do that.

I don't remember about dimensions, I guess it was never used! Agree we should drop it.

That filename doesn't sound ideal/obvious. What filename did we use for the zotero export? Can we reuse or adapt that?

Print date will never have unknown year. Please use YYYY, YYYY-MM, and YYYY-MM-DD. Sorry there isn't an existing method for this already!

We should be consistent in our empty variable filtering. Can we put it in the base data export class somewhere and use the same logic everywhere? I'd prefer to omit empty strings.

We don't have to use the zotero method, but I think it would be good to compare it with your logic. I suspect we may be missing some things — it looks like there could be instance creators other than work authors (I wondered about that but it wasn't obvious when I glanced at the code); I think there are probably some others.

rlskoeser commented 3 years ago

Book data export looks good to me.

Checked both json and csv, and looked at a variety of record types — books, book sections, journal articles; also looked for variants that were marked as translations, have insertions, etc. Also checked records with multiple authors, contributors.

Princeton-CDH / derrida-django