Princeton-CDH / derrida-django

Derrida's Margins - Python/Django web application
https://derridas-margins.princeton.edu
Apache License 2.0
8 stars 1 forks source link

Generate data set readme and datapackage #277

Closed kmcelwee closed 2 years ago

kmcelwee commented 3 years ago

Data cleaning

Notes

kmcelwee commented 3 years ago

Questions

URI Title Short Title
https://derridas-margins.princeton.edu/titles/155/ Lettre à M. de Saint-Germain Lettre à M. de Saint-Germain du 26 février 1770
https://derridas-margins.princeton.edu/titles/153/ Lettre au prince de Würtemberg Lettre au prince de Würtemberg du 10 novembre 1663
https://derridas-margins.princeton.edu/titles/124/ Oeuvres complètes Oeuvres complètes de Karl Abraham
https://derridas-margins.princeton.edu/titles/158/ Oeuvres Complètes, tome II Oeuvres complètes de J.-J. Rousseau, vol. II
https://derridas-margins.princeton.edu/titles/177/ Œuvres complètes, tome I Oeuvres complètes de J.-J. Rousseau, vol. I
https://derridas-margins.princeton.edu/titles/230/ Œuvres complètes, tome III Oeuvres complètes de J.-J. Rousseau, vol. III
https://derridas-margins.princeton.edu/titles/199/ Œuvres complètes, vol. VI Oeuvres complètes de Franz Kafka
reference_id value_count
https://derridas-margins.princeton.edu/references/de-la-grammatologie/26a/ 3
https://derridas-margins.princeton.edu/references/de-la-grammatologie/57a/ 3
https://derridas-margins.princeton.edu/references/de-la-grammatologie/70b/ 3
https://derridas-margins.princeton.edu/references/de-la-grammatologie/395b/ 3
https://derridas-margins.princeton.edu/references/de-la-grammatologie/50c/ 3
https://derridas-margins.princeton.edu/references/de-la-grammatologie/29b/ 3
https://derridas-margins.princeton.edu/references/de-la-grammatologie/27c/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/39d/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/29d/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/29c/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/129b/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/57b/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/197a/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/85b/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/395a/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/132a/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/404c/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/52e/ 2
https://derridas-margins.princeton.edu/references/de-la-grammatologie/26b/ 2
rlskoeser commented 3 years ago

@kmcelwee so glad you are flagging all of these!

  • Brackets on pages? Should I explain that?

Please remind me what this is! 😆

KM: Here's some examples of values in our "pages" column for annotations.csv 🥴 p. 45 [572] p.257 Back flyleaf 1 verso.


  • references.csv or de-la-grammatologie_references.csv?

Let's simplify and just call it references.

KM: ✅ added to todo list


Can we come up with a more meaningful name for the books? I think instances won't make sense outside our system.

KM: library.csv?


Yeah, drop it if it's redundant.

KM: ✅ added to todo list


  • Short titles aren't necessarily shorter (?!)

Weird! I wonder if we're generating them wrong. From the examples you gave, I think we either drop the field or relabel it so the field name is more accurate. But if all that information is included in other fields, let's drop it.

KM: I don't think the short_title field adds information. I'm going to add that to our todo


References IDs aren't unique. Should they be?

Wow! This is interesting, and a good thing to catch. I think the team must have entered multiple copies of the same reference when they weren't sure which copy of a book to link it to. I don't think we should try to change that now, but just reflect the research that was done. It seems like the simplest thing would be to just allow it not to be unique and make sure we note this in the documentation. (Combination of reference id & title id should be unique, though, if it matters).

I see that the search results for these links returns only a single reference — that probably means the indexing wasn't configured properly to handle multiple versions of the same reference, which is probably one of the reasons this never got caught. I think it's ok to live with that; let's just try to document clearly & briefly what the situation is.

KM: Added documentation to-do above ✅


Two books are in annotations but not in instances

This is surprising to me, because team members were only supposed to document annotations that related to references in de la Grammatologie. I guess we should revise the queryset filter on the instance export to include them — seems like the simplest solution to me, but I'm open to suggestions.

KM: Added to todo list ✅

kmcelwee commented 3 years ago

@rlskoeser I edited my own responses in your comment, because I thought it would be more clear. Let me know if I should never do this again haha 😄

rlskoeser commented 3 years ago

Wow, that is a pretty non-obvious way to reply! 😆 🤯 (I know threaded replies can get a bit much, but yeah, please don't do that in future 🙂 )

I like library.csv !

I'll have to look into the page numbers to be sure (and if it's not obvious and there isn't any project documentation about it, then maybe we won't be sure!). My guess is that brackets are to indicate the page number is not actually printed. Maybe make me an asana task?

kmcelwee commented 3 years ago

@rlskoeser I think I've taken care of almost everything. I have some questions inline, and I'm sure you'll have comments. Here are my questions / takeaways from this PR. Looking forward to your feedback when I return!

kmcelwee commented 2 years ago

Further back-and-forth edits will take place with PRDS, but w/r/t development, this is done