CorrelAid / kn_stolpersteine_survey

kn_stolpersteine_survey
0 stars 0 forks source link

Issues regarding the data scheme #14

Closed jstet closed 1 year ago

jstet commented 2 years ago
  1. "geburtsjahr" shows up twice

  2. "geburtsmonat" instead of "geburtsmonate" would be grammatically correct

  3. address a. we could separate the address into street and street number? b. we should highlight that the street names should not be abbreviated (eg. Seestraße instead of Seestr.)

  4. Family: a. "Verwandschaftsbeziehung" or "verschwandschaftsart" instead of "Verwanschaftsgrad" b. Use predefined relationship types (have you maybe already done that?) and add them to the scheme c. We could use gender neutral names for relationship types, e.g. EhepartnerIn instead of separating between Ehemann und Ehefrau or Geschwister instead of Bruder and Schwester d. maybe we could add an id for every victim and then use it to reference to victims in the family attribute. the url is a nice unique identifier for example (edit: nvm, there are urls for families)

  5. add gender a) did we decide on using binary categories?

  6. write the variables with the first character in uppercase?

  7. academic title instead of doctor title? (although there are only doctor titles)

  8. Did we forget the stages of flight? See this document

  9. Stations: a) add type of station? b) "Ort" instead of "haftort", because more general.

  10. Adresse: a) add PLZ?

Sry I'm adding stuff while cleaning and transforming the scraped data, this list escalated quickly :S

vmfelso commented 2 years ago

Hey Jonas, a lot of separate stuff in this issue, I'll try not to miss anything ;) The good news is the data scheme was just written down by me by hand so nothing big to change. I'll try to list things in order and by subject now:

Typos, German mistakes etc: 1, 2, 4a, 6, 9b: sounds good, thank you -- I've fixed them in the README.md and jinja template [DONE]

Address questions: 3, 10: We can separate these out, I think we're getting them from a different data source -> separate issue

Rest of 4: Family membership: in the app we do have a preset category, although they are gendered I think it is fine to prefill this with the person but not the relationship if we don't know the gender. The way we are uniquely idenifying people is a combination of url + first name + last name. I think having a way to select "Who" is the relation would be nice and will look into this -> separate issue

5: can add this, I believe we will keep it binary because of historical reasons and that they are mostly possible to identify [DONE, I wasn't sure about how to word this in German but gave a try.]

7: I don't really have enough context of what other academic titles (MSc, Diplom Ing?) there are, but if I remember correctly there was a nun (or other religious person) in the database with an extra title so maybe we could be more inclusive about that. What do you think @JStet ? [-> let's keep discussion here]

8: Yes, this is such a big one to miss! Thank you for noticing it. -> separate issue

9a: I guess the idea is we will know from our document, but I'm not sure how far we will get with that. What do you think @JStet ? [-> let's keep discussion here]

vmfelso commented 2 years ago

Ideas in call:

jstet commented 2 years ago

I will edit the README to correct german spelling

jstet commented 2 years ago

5af12d0e3847c05d60de414689aa2cd911858d83 : edited data scheme in README to improve spelling

Gender -> Geschlecht Strasse -> Straße Strassennummer -> Hausnummer Anderenamen -> Andere_Namen Durchort -> Durchgangsprt Tod_in_haft -> Tod_in_Gefangenschaft

Stationen:

jstet commented 2 years ago

Ideas in call:

* Academic titles to drop down (Prof Dr, Dr,  possibility of extending it)

Doktor -> Akademischer_Titel

["Dr.", "Prof."]

jstet commented 2 years ago

24421e4e0e39b3333f6952a2b8364c9a0e715282

Geburtsdatum -> Geburtstag

5ebba9102d0bf3f89bfda51542c963aeca994ac3

Todesdatum -> Todestag

jstet commented 2 years ago

d1dd40d4cbdb66933b50d0f318a399d0c282c5de 52b188267bc41769d9388cfa8a1130ee6f19ee7e

Using boolean for "Überlebt" and "Erfolg " instead of ja/nein to keep it consistent

jstet commented 1 year ago

Regarding Flucht: What does "AF" mean again?

vmfelso commented 1 year ago

Hm, good you asked because I'm slightly confused too. I think "AF" was Anfang and "ED" Ende, but only because of how AF is always before ED.

jstet commented 1 year ago

5182bb1b3bc4a7150049dc6377ff15b484a7ca66

Added "Verlegt" to data (Date the stolperstein was laid)

@vmfelso do you agree with this?

jstet commented 1 year ago

6e8fe6f00fd6385305c1a6049916c4d230dd9768 and f00157e6f249068b1132677c14afa52464a021b2

Added ID and Verlegejahr. Removed Hausnummer and Straße from data. Will prepopulate the database based on this. Mostly uses scraped values; i will add the rest manually.

@vmfelso do u agree?

vmfelso commented 1 year ago

Thank you for making this point about the numeric ID, right now for the backend we have a random ObjectId.

I do think we might have a problem since Verlegejahr, Hausnummer and Straße are not unique, for example in the case of a family with stones placed next to each other. Would it make sense to add also birth year (or, because of the edge case of twins, Geburtsname + Vorname?)

jstet commented 1 year ago

current version: https://github.com/CorrelAid/stolpersteine-kn/blob/main/data/final/cleaned.csv

jstet commented 1 year ago

how about we also assign IDs to the stationen?