C2DH / jdh-notebook

A collection of Jupyter notebooks for the Journal of Digital History
https://journalofdigitalhistory.org
GNU Affero General Public License v3.0
4 stars 1 forks source link

Technical review: The secularisation of future expectations in practise #149

Open inactinique opened 5 months ago

inactinique commented 5 months ago

PID: jYcpqGfdXPra

https://github.com/jdh-observer/jYcpqGfdXPra

eliselavy commented 4 months ago

@inactinique no hermeutics? https://journalofdigitalhistory.org/en/notebook-viewer/JTJGcHJveHktZ2l0aHVidXNlcmNvbnRlbnQlMkZzcmJkdHMlMkZ0aGVfc2VjdWxhcmlzYXRpb25fb2ZfZnV0dXJlX2V4cGVjdGF0aW9ucyUyRm1haW4lMkZ0aGUtc2VjdWxhcmlzYXRpb24tb2YtZnV0dXJlLWV4cGVjdGF0aW9ucy1hbm9ueW1vdXMuaXB5bmI=?layer=narrative&lh=688&pidx=70&s=bib

eliselavy commented 3 months ago

Problem nbconvert json error:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Test the file PCEEC_metadata.json with https://jsonlint.com/ Screenshot 2024-04-11 at 14 17 58

eliselavy commented 3 months ago

def search_corpus(corpus): queries = [ "god", "jesus", "jhesus", "jhesu", "christ", "lord", "almighty", "father", "saint", "sant", "dicu", "dieu", "mary", "mery", "merry" ] hits = [] window = 50 for i, text in enumerate(texts): tokenpos = 0 for line in text: tokens = line.split(" ") print(len(tokens)) for t, token in enumerate(tokens): for query in queries: if query in token.lower(): index = tokenpos + t hit = { "query": query, "letter_id": "SBC%s" % (i + 1), "hit_id": "SBC%s.%s_%s" % (i + 1, index, index), "left": " ".join(tokens[max(0, t - window):t]), "hit": token, "right": " ".join(tokens[t + 1:min(len(tokens), t + window)]) } hits.append(hit) tokenpos += t return hits corpus = json.load("./script/PCEEC_metadata_from_author.json", encoding="utf-8") hits = search_corpus(corpus)

df_hits = pd.DataFrame.from_dict(hits) df_hits_with_meta = pd.merge(df_hits, df_social, on="letter_id", how="inner") df_hits_with_meta.to_csv("dataset_paper.csv", sep="\t")


Reformat in 

import json import pandas as pd

def search_corpus(corpus): queries = ["god","jesus","jhesus","jhesu","christ","lord","almighty","father","saint","sant","dicu","dieu","mary","mery","merry"] hits = [] window = 50 for i,text in enumerate(corpus): tokenpos = 0 for line in text: tokens = line.split(" ") for t,token in enumerate(tokens): for query in queries: if query in token.lower(): index = tokenpos + t hit = {"query":query, "letter_id":"SBC%s"%(i+1), "hit_id":"SBC%s.%s_%s"%(i+1,index,index), "left":" ".join(tokens[max(0,t-window):t]), "hit":token, "right":" ".join(tokens[t+1:min(len(tokens),t+window)]) } hits.append(hit) tokenpos += len(tokens) return hits

with open("./script/PCEEC_metadata_from_author.json","r",encoding="utf-8") as f: corpus = json.load(f)

hits = search_corpus(corpus)

df_hits = pd.DataFrame.from_dict(hits) df_hits_with_meta=pd.merge(df_hits,df_social,on="letter_id",how="inner") df_hits_with_meta.to_csv("dataset_paper.csv",sep="\t")



- df_hits  is empty
- The 'letter_id' in the JSON file is not in the format "SB_C_%s"%(i+1), but rather in the format "CROMWEL_00%". 

> Author needs to be contacted
eliselavy commented 1 month ago

Anchor to correct , see: Screenshot 2024-06-04 at 11 51 28 revealed to bug https://github.com/C2DH/journal-of-digital-history/issues/639