Open inactinique opened 5 months ago
Problem nbconvert json error:
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Test the file PCEEC_metadata.json with https://jsonlint.com/
Test made with the initial JSON file from the author
File is the same , not in the same order
Problem with this snippet of code
with open('./script/PCEEC_metadata_from_author.json', 'w') as outfile:
def search_corpus(corpus): queries = [ "god", "jesus", "jhesus", "jhesu", "christ", "lord", "almighty", "father", "saint", "sant", "dicu", "dieu", "mary", "mery", "merry" ] hits = [] window = 50 for i, text in enumerate(texts): tokenpos = 0 for line in text: tokens = line.split(" ") print(len(tokens)) for t, token in enumerate(tokens): for query in queries: if query in token.lower(): index = tokenpos + t hit = { "query": query, "letter_id": "SBC%s" % (i + 1), "hit_id": "SBC%s.%s_%s" % (i + 1, index, index), "left": " ".join(tokens[max(0, t - window):t]), "hit": token, "right": " ".join(tokens[t + 1:min(len(tokens), t + window)]) } hits.append(hit) tokenpos += t return hits corpus = json.load("./script/PCEEC_metadata_from_author.json", encoding="utf-8") hits = search_corpus(corpus)
df_hits = pd.DataFrame.from_dict(hits) df_hits_with_meta = pd.merge(df_hits, df_social, on="letter_id", how="inner") df_hits_with_meta.to_csv("dataset_paper.csv", sep="\t")
Reformat in
import json import pandas as pd
def search_corpus(corpus): queries = ["god","jesus","jhesus","jhesu","christ","lord","almighty","father","saint","sant","dicu","dieu","mary","mery","merry"] hits = [] window = 50 for i,text in enumerate(corpus): tokenpos = 0 for line in text: tokens = line.split(" ") for t,token in enumerate(tokens): for query in queries: if query in token.lower(): index = tokenpos + t hit = {"query":query, "letter_id":"SBC%s"%(i+1), "hit_id":"SBC%s.%s_%s"%(i+1,index,index), "left":" ".join(tokens[max(0,t-window):t]), "hit":token, "right":" ".join(tokens[t+1:min(len(tokens),t+window)]) } hits.append(hit) tokenpos += len(tokens) return hits
with open("./script/PCEEC_metadata_from_author.json","r",encoding="utf-8") as f: corpus = json.load(f)
hits = search_corpus(corpus)
df_hits = pd.DataFrame.from_dict(hits) df_hits_with_meta=pd.merge(df_hits,df_social,on="letter_id",how="inner") df_hits_with_meta.to_csv("dataset_paper.csv",sep="\t")
- df_hits is empty
- The 'letter_id' in the JSON file is not in the format "SB_C_%s"%(i+1), but rather in the format "CROMWEL_00%".
> Author needs to be contacted
Anchor to correct , see:
revealed to bug https://github.com/C2DH/journal-of-digital-history/issues/639
PID: jYcpqGfdXPra
https://github.com/jdh-observer/jYcpqGfdXPra