Open changdama opened 2 months ago
Quick Overview List your top 3 tasks or objectives for this week:
Task 1:Describe the types of studies/information that each of these platforms("Central","CINAHL","ClinicalTrials.gov","Embase","Google Scholar","MEDLINE","PsycINFO","PubMed","Scopus","Web of Science","World Health Organization")covers
Task 2:Conduct an initial search for literature related to Neuroarchitecture using these platforms and add any papers you find that are relevant to the topic to the Zotero
Task 3:Summarize how you generally approach reading a scientific article.
Task 4:Answer some questions of “Covidence”
Task 5:Create a brief summary of the key takeaways of this presentation and the platform-“Covidence”
Weekly Accomplishments
What tasks did you complete this week? (Include links) 1.Different Platform covers"Neuroarchitecture"
a. ”BioMed Central” is an open access journal in the field of biomedicine. I searched and obtained 27 papers on “neuroarchitecture”;
b. “CINAHL” is a database of nursing and allied health literature, which usually requires login to “EBSCOhost”, but GT has access to “CINAHL Complete”. I searched and obtained 8 results on “neuroarchitecture;
c. "ClinicalTrials.gov" is a publicly accessible database of privately and publicly funded clinical studies conducted around the world. The literature search tool: "PubMed" found 237 papers on “neuroarchitecture”, the "PubMed Central" database obtained 1194 papers on “neuroarchitecture”, and the "NLM Catalog" obtained one result.
d. "Embase" is an online literature database focusing on the fields of biomedicine and pharmacy, operated by "Elsevier". It is one of the world's largest biomedical databases and is particularly important in drug research and development, clinical medicine and drug safety research. However, I do not have access to "Embase", but I do have access to Science Direct, and I found 637 results for “neuroarchitecture” from 2008 to 2024.
e. "Google Scholar" is a free and open search engine for scholarly literature in many disciplines. Searching for “neuroarchitecture” (2024: 484, 2023: 1190, 2020: 2690)
f. "MEDLINE" is a bibliographic database of life science and biomedical information. GT has access to it. Searching for “neuroarchitecture” from 2014 to 2024, there are 89 Academic Journals.
g. "PsycINFO" is a bibliographic database of psychology and behavioral science literature, which is also searchable and accessible via "EBSCOhost". With access granted, there are 38 results for “neuroarchitecture” in academic Journals, 4 results in books, and 2 results in dissertations for 2014-2024.
h. "Scopus" is a large abstract and citation database covering peer-reviewed literature from various disciplines. It is part of "Elsevier", along with “Embase” and “Science Direct”. A search for “neuroarchitecture” returned 283 results.
i. "Web of Science" is a comprehensive academic literature database and citation indexing platform operated by "Clarivate Analytics" that covers multiple disciplines such as natural sciences, social sciences, humanities, arts, and engineering. A search was conducted and 291 results were found for “neuroarchitecture”.
j. "WHO Data" Collection is a comprehensive collection of data and information on global health, a data set and information integration platform that supports public health research, policymaking, and data-driven decision-making. No results were found for “neuroarchitecture”. 2.Some answer of "Coevidence"
a.Covidence is a cloud-based platform designed to simplify the review process for literature reviews, including titles and
abstracts.
b.Why should we use that: For the first function is collaboratively managing unlimited references and PDFs, which greatly
improves our reading and summarizing efficiency.; Second, customizethe screening process, full-text reviews, critica
appraisals, and data extraction. Last, keep a detailed record of each step in th reviews.
c. It includes:1.Systematic literature reviews (SLRs):SLR provides reliable results by organizing a number of evidence in the
paper to answer specific research questions. First, relevant questions are searched for, and then the search results are
screened using previously specified inclusion and exclusion criteria to determine whether they meet the requirements.
Finally, the final results are summarized and presented.
2.Rapid reviews:Quickly conduct a systematic review based on pre-defined questions, using pre-specified search and filter
methods.
3.Umbrella reviews or Overview of reviews:Combining multiple screening or evaluation systems, more research
questions, and comprehensive research results.
4.Scoping reviews:Filter according to the scope of the filter, not a specific filter question.
5.Literature reviews or narrative reviews
d.Understand the specificity of the problem to be studied. If it is a particularly clear screening condition, use SLR; if the
subject is broader, there is less literature or the research type is very diverse, use a scope review; if there are a large
number of relevant studies, simply use literature reviews or narrative reviews. Rapid reviews are also very good. Not only
do they quickly search and summarize, but they also give us a preliminary understanding of the problem under study so
that we can delve deeper.
3.Create a brief summary of the key takeaways of "Covidence 101" presentation and the platform
First, accept the invitation from Dr. Haas and create your individual account. Start a new review by selecting the appropriate settings, particularly focusing on choosing extraction tools and setting the eligibility criteria under the review settings. Next, import your references (often selecting the "screen" option). Proceed with the "Title and Abstract Screening" step, where you will choose to include or exclude references. Then, move on to the full-text review and extraction stages, where you'll set up the Data Extraction and Quality Assessment templates. Finally, complete the process by exporting your results, and also export prisma. Platforms: "Upcoming webinars","Knowledge Base" for helping resources. 4.Summarize how to read a scientific article.
Summarize for reading an academic papers efficiently: First, read actively and frequently in three times: first, skim the paper briefly; second, try to understand the content of the paper, focusing on the structure, charts and data; third, read while taking notes and thinking critically. When quickly understanding a paper, focus on the abstract, the purpose and method overview in the introduction, and the conclusion to grasp main findings of the research. Distinguishing between different types of articles (such as method, review, commentary, etc.) can help clarify the focus of each type of article and further understand their intent. Maintain a friendly attitude and affirm every paper you read, while maintaining a critical thinking mindset, and thinking about how to further develop research in your own field and expertise. When reading, analyze charts and tables one by one to understand the data sources, identify the experimental groups and variables, and pay attention to the legends and titles.
Reading order is to first read the abstract to quickly understand the research, results, and interpretation, then read the introduction to understand the motivation for the research, then quickly browse the charts and data in the results section, and finally read the discussion and conclusion section to understand the significance of the results. The methods section can be left until last to assess the replication of the research and the method appropriateness.
Before reading, you can use a scientific dictionary (www.AccessScience.com)to look up unfamiliar terms and notebook(Mind maps) 5.203 papers were selected from various databases and uploaded to Zetero.
Tasks are still ongoing? Current status? 1.Papers related to “neuroarchitecture” not yet indexed in the “web of science” database to Zetero. 2.I have only understood the basic concepts of “Coevidence” and how to use it, and have not started practical training. 3.The paper on “neuroarchitecture” is still at the conceptual level. Starting next week, I will continue to learn from the 203 papers collected this week. Challenges and Learning
What was the biggest challenge you faced this week? How did you address it? 1.When learning about various databases, it is awkward to go to various websites to search for terms related to “neuroarchitecture”. In the end, I found that going to the GT library and directly searching for the corresponding database is simple, direct, and efficient. 2.Unfamiliar with the functions of “Covidence”, lack of training in practical application, and only stay at the conceptual level from"upcoming webinar" anf "knowledge base".
What's one new thing you learned or skill you improved? 1.Learn about the various platforms related to "neuroarchitecture" and quickly collect relevant papers from the database, and upload to Zotero. 2.Learned how to read a scientific paper quickly and "Covidence".
Did you attend any team meetings? Key takeaways? I attended the first Neuroarchitecture zoom meeting on Wednesday(9.11), and solve my question about"emotion mode" and learned how to read papers and how to write literature review.
Reflection and Planning
Quick Overview List your top 3 tasks or objectives for this week:
Weekly Accomplishments
What tasks did you complete this week? (Include links) 1.Draw a mind map of_" Exploring Methodological Approaches of Experimental Studies in the Field of Neuroarchitecture: A Systematic Review" Mind Map week 4 .pdf 2.Using system map to sort out my own interests and search terms. my interests of Neuroarchitecture.pdf
Tasks are still ongoing?
After reading this week's literature review, I found that I gradually understood the structure of the literature review, and found that my focus on "neuroarchitecture" is the nero-urban system. I also learned more about the application of data-driven methods in this field, such as EGG, fMRI, VR, and other body feature signals. Challenges and Learning
What was the biggest challenge you faced this week? How did you address it? 1.Refine your search by topics of interest to you: "neuroarchitecture" 2.Coordinate team communication, summarize search terms and prepare work progress ppt
What's one new thing you learned or skill you improved?
I attended the first Neuroarchitecture zoom meeting on Wednesday, and present last week progress ,some results of different databases and learn about how to write literature. Also we met in person on Thursday 11am and joined a Teams meeting on Sunday @ 3:30pm to conclude search terms.
Reflection and Planning
Your progress this week? Read new system literature review,select search terms.
Main focus for next week?
1.After finalizing the keywords, divide the task of searching for papers based on content, quantity, and importance, and import them into Covidence. 2.In Covidence, establish the "Eligibility Criteria" to determine the factors for "exclude" and "include." 3.Conduct the screening of the searched papers through "title and abstract screening" -> "Full text review" -> "Extraction."
Tutorial for EEG and VR
Quick Overview List your top 3 tasks or objectives for this week:
Weekly Accomplishments
2.I learned about the visualization principles of text mining and NLP methods presented in the article "Data Science for Building Energy Efficiency: A Comprehensive Text-Mining Driven Review of Scientific Literature". phase 1. Data collection and preprocessing stage: Call Elsevier api 30000 articles, including abstracts, titles, full texts and keywords. Data cleaning through “NLTK”; conversion to lowercase, deletion of unimportant stop words, stemming and reduction, and classification of keywords. phase 2. Train using Word2Vec model, analyze semantic similarity, and generate an emb file phase3. Using the emb file, predict the most similar keywords in each category (e.g., data category, data science category, etc.) based on the existing keywords in the category. Phase 4: Generate a histogram of similarities
1.The search terms were rechecked by Dr. Haas and Dr. Kastner. 2.By generalizing the methods in the paper, a semantic similarity analysis heat map was generated between mental health or well-being and urban.
Challenges and Learning
What was the biggest challenge you faced this week? How did you address it? 1.Filter out the search terms as soon as possible. 2.Learn how to call the PubMed api and use NLTK to clean data.
What's one new thing you learned or skill you improved?
I attended the first Neuroarchitecture zoom meeting on Wednesday, and present last week progress ,presenting search terms of individual and groups. Also arrange a Teams meeting on Saturday @ 11:00pm to re-organize search terms.
Reflection and Planning
Your progress this week? Re-oranizing search terms and learn data visualiztion by using NLP method.
Main focus for next week?
1.The re-organizing terms will be confirmed by Dr. Haas and Dr. Kastner. 2.Search the results of key words we confirmed by using Boolean Formula in different databases. 3.Try to call PubMed API, and use NLTK to clean data in Jupyter Notebook.
How to call PubMed API.
Quick Overview List your top 3 tasks or objectives for this week:
Weekly Accomplishments
TITLE ( "mental health" OR "mental-health" OR "well-being" OR "well being" OR wellbeing ) AND TITLE ( "built environment" OR "building architecture" OR "architectural design" OR "building design" OR "environmental design" OR "urban architecture" OR "urban environment" OR "sustainable architecture" ) AND NOT DOCTYPE ( re ) AND NOT DOCTYPE ( "ma" ) AND PUBYEAR > 2013 AND PUBYEAR < 2025 AND PUBYEAR > 2013 AND PUBYEAR < 2025 AND PUBYEAR > 2013 AND PUBYEAR < 2025 AND ( LIMIT-TO ( SRCTYPE , "j" ) OR LIMIT-TO ( SRCTYPE , "p" ) ) AND ( LIMIT-TO ( PUBSTAGE , "final" ) OR LIMIT-TO ( PUBSTAGE , "aip" ) ) AND ( LIMIT-TO ( SUBJAREA , "SOCI" ) OR LIMIT-TO ( SUBJAREA , "ENGI" ) OR LIMIT-TO ( SUBJAREA , "ENVI" ) OR LIMIT-TO ( SUBJAREA , "COMP" ) OR LIMIT-TO ( SUBJAREA , "ENER" ) OR LIMIT-TO ( SUBJAREA , "ARTS" ) OR LIMIT-TO ( SUBJAREA , "PSYC" ) OR LIMIT-TO ( SUBJAREA , "HEAL" ) OR LIMIT-TO ( SUBJAREA , "MULT" ) OR LIMIT-TO ( SUBJAREA , "MATE" ) OR LIMIT-TO ( SUBJAREA , "NEUR" ) OR LIMIT-TO ( SUBJAREA , "MEDI" ) ) AND ( LIMIT-TO ( DOCTYPE , "ar" ) OR LIMIT-TO ( DOCTYPE , "cp" ) ) AND ( LIMIT-TO ( LANGUAGE , "English" ) )
Results: 145 results
2.Fix the search criteria of Google Scholar
Boolean Formula:
allintitle: ("mental health" OR "well being" OR "wellbeing") AND ( "built environment" OR "building architecture" OR "architectural design" OR "building design" OR "environmental design" OR "urban architecture" OR "urban environment" OR "sustainable architecture") -review -"meta analysis"
Results: 301 results
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from Bio import Entrez
def fetch_pubmed(query, retmax=100): handle = Entrez.esearch(db="pubmed", term=query, retmax=retmax) record = Entrez.read(handle) ids = record["IdList"] handle.close() return ids
def fetch_abstracts(id_list): abstracts = [] keywords= ['health', 'urban', ''] for pubmed_id in id_list: handle = Entrez.efetch(db="pubmed", id=pubmed_id, rettype="abstract", retmode="text") if api.keyword in keywords:
abstract = handle.read()
abstracts.append(abstract)
handle.close()
return abstracts
pubmed_ids = fetch_pubmed("well being", retmax=100) abstracts = fetch_abstracts(pubmed_ids) import re from nltk.corpus import stopwords from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
def clean_text(text):
text = re.sub(r'[^\w\s]', '', text)
# Convert to lowercase
text = text.lower()
# Word Tokenization
words = word_tokenize(text)
# Remove stop words
words = [word for word in words if word not in stop_words]
return words
cleaned_abstracts = [clean_text(abstract) for abstract in abstracts] from gensim.models import Word2Vec
model = Word2Vec(sentences=cleaned_abstracts, vector_size=100, window=5, min_count=1, workers=4)
vocabulary = model.wv.index_to_key
with open("vocabulary.txt", "w") as f: for word in vocabulary: f.write(f"{word}\n")
- Current status?
After importing the search results into Covidence, I began discussing[ eligibility criteria](https://app.covidence.org/reviews/408264/criteria) with my team members and scheduled an in-person discussion for 11:30 a.m. on October 10, 2024.
Plan to create a table that shares eligibility criteria.
**Challenges and Learning**
- What was the biggest challenge you faced this week? How did you address it?
1.How to get bookmark.json file based on the github which Dr. Kastner give for data visualization.?The paper doen't mention that.
![image](https://github.com/user-attachments/assets/0bf440c5-28ce-441b-95ba-08741b88ae82)
- What's one new thing you learned or skill you improved?
1. Use different boolean formulas with keywords to search different databases for literature – the basis of a literature review.
2. Learned to write code for data cleaning with natural language(nltk ) and the Word2Vec training model
- Did you attend any team meetings? Key takeaways?
I attended the first Neuroarchitecture zoom meeting on Wednesday, and present last week progress .Reported on their own search results and learned that the main task for the next week would be to develop eligibility criteria.
**Reflection and Planning**
- Your progress this week?
Search Scopus Results using boolean formula of Scopus. For Data visualzation, write the code of data cleaning using NITK and the Word2Vec training model.
- Main focus for next week?
1.Making the excel for eligibility criteria
2.Finishing the api task, try to get Vocabulary.txt
- Any resources you're looking for?
How to get bookmark.json file based on the github which Dr. Kastner give for data visualization.
Quick Overview List your top 3 tasks or objectives for this week:
Task 1: Complete the eligibility criteria on Google Sheets. Task 2: Search some system review as reference in order to finish the sheets Task 3: Learn the writing standards of eligibility criteria in Covidence.
Weekly Accomplishments
What tasks did you complete this week? (Include links)
2.Looking for some examples how to write critriea and arrange a meeting to tell others. 1-s2.0-S0277953621005748-main.pdf 1-s2.0-S1353829217308869-main.pdf 1-s2.0-S2405844024137073-main.pdf fpsyt-12-758039.pdf Journal of Environmental and Public Health - 2020 - Núñez-González - Overview of Systematic Reviews of the Built.pdf
3.Learn the writing standards of eligibility criteria in Covidence. Learning Link: https://support.covidence.org/help/how-to-create-and-manage-eligibility-criteria#population
Tasks are still ongoing? Continue to process data cleaning to get vocabulary.txt
It's too difficult to use PubMed API, I change to use Scopus api to continue.
Challenges and Learning
What was the biggest challenge you faced this week? How did you address it? When referencing the API, an error occurs: pybliometrics has not been initialized with a configuration file. Even though I have configured and downloaded “pybliometrics”, I still cannot find the configuration file. This may be a problem with the permissions of the Docker virtual environment, so you need to manually reference the API to perform the following steps.
`FileNotFoundError Traceback (most recent call last)
Cell In[9], line 29
26 return abstracts
28 # Fetch article EIDs based on search term
---> 29 scopus_eids = fetch_scopus("well being", count=100)
30 abstracts = fetch_abstracts(scopus_eids)
32 # Define stop words
Cell In[9], line 14, in fetch_scopus(query, count)
13 def fetch_scopus(query, count=100):
---> 14 s = ScopusSearch(query, subscriber=True, api_key=scopus_api_key)
15 return s.get_eids()[:count]
File /usr/local/lib/python3.9/dist-packages/pybliometrics/scopus/scopus_search.py:214, in ScopusSearch.__init__(self, query, refresh, view, verbose, download, integrity_fields, integrity_action, subscriber, unescape, **kwds)
212 self._query = query
213 self._view = view
--> 214 Search.__init__(self, query=query, api='ScopusSearch', size=size,
215 cursor=subscriber, download=download,
216 verbose=verbose, **kwds)
217 self.unescape = unescape
File /usr/local/lib/python3.9/dist-packages/pybliometrics/scopus/superclasses/search.py:61, in Search.__init__(self, query, api, size, cursor, download, verbose, **kwds)
59 stem = md5(name.encode('utf8')).hexdigest()
60 # Get cache file path
---> 61 config = get_config()
62 parent = Path(config.get('Directories', api))
63 self._cache_file_path = parent/self._view/stem
File /usr/local/lib/python3.9/dist-packages/pybliometrics/scopus/utils/startup.py:75, in get_config()
73 """Function to get the config parser."""
74 if not CONFIG:
---> 75 raise FileNotFoundError('No configuration file found.'
76 'Please initialize Pybliometrics with init().\n'
77 'For more information visit: '
78 'https://pybliometrics.readthedocs.io/en/stable/configuration.html')
79 return CONFIG
FileNotFoundError: No configuration file found.Please initialize Pybliometrics with init().
For more information visit: `https://pybliometrics.readthedocs.io/en/stable/configuration.html`
What's one new thing you learned or skill you improved? Learned the method of referencing the scopus api and the referencing rules for the eligiblity critriea in covidence Did you attend any team meetings? Key takeaways? I attended the Neuroarchitecture zoom meeting on Wednesday, and present last week progress. I arrange a meeting to introduce how to fill the critriea sheet for everyone.
Reflection and Planning
Main focus for next week?
1.finish data cleaning 2.Conclude eligibility critriea on covidence with teams and Dr. kastner and Dr. Haas.
Any resources you're looking for? How to get bookmark.json file based on the github which Dr. Kastner give for data visualization.
Quick Overview List your top 3 tasks or objectives for this week:
Task 1: Start Title and abstract screening, Screen at least 50 papers per person Task 2: Complete using scopus api to clean the data and get vocabulary.txt Task 3: Import abstracts on papers with missing abstracts
Weekly Accomplishments
What tasks did you complete this week? (Include links) 1.Solved the problem of introducing the Scopus API last week, using the Nltk library, search Scopus for articles, etch abstracts, titles, keywords, full-text and other data from Scopus, Define complex query with boolean formula, stop words cleaning and finally get voc.txt, which represents the frequency of key words in all articles searched according to the boolean formula, and records the number of articles.
import requests
import nltk
from collections import Counter
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re
# Scopus API key
scopus_api_key = "3a2d947799515fbd27b82a851d8bab0e"
# Function to search Scopus for articles
def fetch_scopus(query, count=200):
url = "https://api.elsevier.com/content/search/scopus"
headers = {
"X-ELS-APIKey": scopus_api_key,
"Accept": "application/json"
}
params = {
"query": query,
"count": count
}
response = requests.get(url, headers=headers, params=params)
if response.status_code == 200:
results = response.json()
eids = [entry['eid'] for entry in results.get("search-results", {}).get("entry", [])]
return eids
else:
print(f"Error: {response.status_code}, {response.text}")
return []
# etch abstracts, titles, keywords, full-text and other data from Scopus
def fetch_article_data(eids):
articles_data = []
keywords = ['mental health', 'mental-health', 'well-being', 'well being', 'wellbeing',
'built environment', 'building architecture', 'architectural design',
'building design', 'environmental design', 'urban architecture',
'urban environment', 'sustainable architecture']
# For each EID, request article details from the Scopus API
for eid in eids:
url = f"https://api.elsevier.com/content/abstract/eid/{eid}"
headers = {
"X-ELS-APIKey": scopus_api_key,
"Accept": "application/json"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
article = response.json()
coredata = article.get("abstracts-retrieval-response", {}).get("coredata", {})
# Retrieve title, abstract, and keywords
title = coredata.get("dc:title", "")
abstract = coredata.get("dc:description", "")
keywords_list = article.get("abstracts-retrieval-response", {}).get("authkeywords", None)
# Join keywords if they are not None
keywords_text = " ".join([kw for kw in keywords_list if kw is not None]) if keywords_list else ""
# Combine title, abstract, and keywords for processing
full_text = f"{title} {abstract} {keywords_text}"
# Filter based on the presence of keywords in the full text
if any(keyword in full_text.lower() for keyword in keywords):
articles_data.append(full_text)
else:
print(f"Error fetching data for {eid}: {response.status_code}, {response.text}")
return articles_data
# Define complex query with boolean formula
query = '("mental health" OR "mental-health" OR "well-being" OR "well being" OR wellbeing) AND ("built environment" OR "building architecture" OR "architectural design" OR "building design" OR "environmental design" OR "urban architecture" OR "urban environment" OR "sustainable architecture")'
# Fetch article EIDs based on the query
scopus_eids = fetch_scopus(query, count=200)
articles_data = fetch_article_data(scopus_eids)
# Define stop words
stop_words = set(stopwords.words('english'))
# Function to clean text
def clean_text(text):
# Remove special characters and punctuation
text = re.sub(r'[^\w\s]', '', text)
# Convert to lowercase
text = text.lower()
# Word Tokenization
words = word_tokenize(text)
# Remove stop words
words = [word for word in words if word not in stop_words]
return words
# Clean all articles data and flatten the list of words
all_words = [word for article in articles_data for word in clean_text(article) if article]
# Count word frequencies
word_counts = Counter(all_words)
# Save vocabulary with counts to file
with open("vocabulary.txt", "w") as f:
for word, count in sorted(word_counts.items(), key=lambda item: item[1], reverse=True):
f.write(f"{word} {count}\n")
2.Screen 100 title and abstract,
3.Solve the problem o fthe title and abstract that have been imported without an abstract Reason:
Tasks are still ongoing?
Challenges and Learning
What was the biggest challenge you faced this week? How did you address it? The first challenge is how to use Scopus API and fix the problem from last week, I search "find solution on github" handle API Responses, construct and send API requests. https://github.com/alistairwalsh/scopus
The second challenge is solving problem of the title and abstract that have been imported without an abstract Two strategy:
Go to different platforms to search this title or abstract After seacrhing that, export this as *.ris, and re-import into Covidence again. The original title without abstract is manually considered a duplicate
What's one new thing you learned or skill you improved? If I encounter a code problem that i can't solve, try searching for the answer on GitHub. Learn some skills like "Ctrl+F" input exclude critriea, quickly filter title and abstract which we have to exclude.
Reflection and Planning
Main focus for next week?
1.finish training Word2Vec Model and Generate Bookmark.json 2.Complete all title and abstract reviews
Any resources you're looking for? How to use Word2Vec Model and Generate Bookmark.json.
Quick Overview List your top 3 tasks or objectives for this week:
Task 1: Complete all title and abstract reviews Task 2: Using Word2Vec Model to imap words to a vector space and capture the contextual relationships between words, so that semantically similar words are closer together in the vector space. Task 3: Extract the pre-trained embedding vector file and use a dimensionality reduction algorithm (PCA, t-SNE, UMAP) to reduce high-dimensional data to a two-dimensional representation, and finally generate a JSON file (bookmark.json) containing the dimensionality reduction results for visualization or further analysis.
Weekly Accomplishments
What tasks did you complete this week? (Include links)
# Import the gensim library
import gensim
from gensim.models import Word2Vec
import logging
with open('vocabulary.txt', 'r', encoding='utf-8') as f: sentences = [line.strip().split() for line in f]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format('embedding_vec.emb', binary=False)
model = gensim.models.KeyedVectors.load_word2vec_format(fname='embedding_vec.emb', unicode_errors='strict')
2. Extract the embedding vector file and reduce high-dimensional data to a two-dimensional representation to get "_bookmark.json_". Reference: https://github.com/turbomaze/word2vecjson
import gensim import json import numpy as np from sklearn.decomposition import PCA from sklearn.manifold import TSNE import umap import re
embedding_file = "embedding_vec.emb" model = gensim.models.KeyedVectors.load_word2vec_format(embedding_file, binary=False)
with open("vocabulary.txt", "r", encoding="utf-8") as f: vocabulary = [re.sub(r'\s+\d+$', '', line.strip().lower()) for line in f if line.strip()]
embeddings = [] not_found_words = [] for word in vocabulary: if word in model: embeddings.append(model[word]) else: not_found_words.append(word)
embeddings = np.array(embeddings) print(f"Embeddings shape: {embeddings.shape}")
pca = PCA(n_components=2).fit_transform(embeddings) tsne = TSNE(n_components=2, perplexity=5, learning_rate=1, n_iter=5000).fit_transform(embeddings) # 设置较高的迭代次数 umap_result = umap.UMAP(n_neighbors=15, n_components=2).fit_transform(embeddings)
projections = [] for i, word in enumerate(vocabulary): if word in model: projections.append({ "word": word, "pca-0": float(pca[i][0]), "pca-1": float(pca[i][1]), "tsne-0": float(tsne[i][0]), "tsne-1": float(tsne[i][1]), "umap-0": float(umap_result[i][0]), "umap-1": float(umap_result[i][1]) })
bookmark_config = { "label": "State 0", "isSelected": True, "tSNEIteration": 5000, # Set a higher number of iterations "tSNEPerplexity": 5, "tSNELearningRate": 1, "tSNEis3d": False, # 2d "umapIs3d": False, "umapNeighbors": 15, "projections": projections, "selectedProjection": "umap", "dataSetDimensions": [len(vocabulary), embeddings.shape[1]], "cameraDef": { "orthographic": True, "position": [0, 0, 10], # Set an initial position more suitable for 2D view "target": [0, 0, 0], "zoom": 1.2 # Set a zoom that is more suitable for 2D view }, "selectedColorOptionName": "category", "selectedLabelOption": "word" }
with open("bookmark.json", "w") as json_file: json.dump(bookmark_config, json_file, indent=4)
3.Screen all the title and abstract
![image](https://github.com/user-attachments/assets/f45a83f3-1786-4e95-929c-1c49076ee303)
Tasks are still ongoing?
- 1.After getting "voc.txt", "embedding_vec.emb" and "bookmark.json", using them to data analytics and get visualization, such as N-gram graph networks and heatmaps, Generate hierarchical clustering dendrograms and heatmaps,and Generate a heatmap of cosine similarity relationships across categories
- 2.make a workflow diagram of training word2vec model and get bookmark.json.
**Challenges and Learning**
What was the biggest challenge you faced this week? How did you address it?
First, as for training word2vec and base on it, getting bookmark.json, for me, is a big challenge, I learn from some excellent example from github to write my own code. Finally, I succeeded.
Second, some problem was produced when I screening the title and abstact.
- Are articles that mention both mental health and physical health included? Although we are conducting a literature review with mental health as the outcome, these types of articles also have mental health sections that can be referenced?
- For mental health/wellbeing outcomes, could terms like “job satisfaction”, “satisfaction”, “stress” be classified under wellbeing without specifically mentioning the term “wellbeing”?
- There are very narrow built environments discussed in some of the articles screened like mining camps, would those be included?
- Transportation (Reducing stress from commuting or causing traffic-related stress). Is it excluded?
What's one new thing you learned or skill you improved?
learn from excellent code to fix my code problem, which is very important part for debugging and learing.
**Reflection and Planning**
Main focus for next week?
1. Data analytics and get visualization, such as N-gram graph networks and heatmaps.
2. Upload full text pdfs of the title and abstract with "yes" selected
Any resources you're looking for?
Not yet.
Quick Overview List your top 3 tasks or objectives for this week:
Task 1: Upload full text pdfs of the title and abstract with "yes" selected. Task 2: Review the irrelevant studies one more time, (especially older adults and children but don't give age range, select”maybe”) Task 3: Finish the visualization of N-gram graph networks and heatmaps.
Weekly Accomplishments
What tasks did you complete this week? (Include links)
2.Review the irrelevant studies one more time, mainly focus on:
3.Make sure the number of full-text review we have to screen finally.
4.Finish the visualization of N-gram graph networks and heatmaps.
#import modules
from gensim.models import KeyedVectors
import gensim
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
import copy
import random
from matplotlib import cm
import matplotlib as mpl
import json
#data preprocessing
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import normalize, scale, MinMaxScaler
#clustering modules
from sklearn.datasets import make_blobs
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
%matplotlib inline
#Graph embeddings visualization
from adjustText import adjust_text
#https://stackoverflow.com/questions/19073683/matplotlib-overlapping-annotations-text
#utilities
from IPython.display import display, HTML
import warnings
warnings.filterwarnings('ignore')
#data directories
embedding_vector_file = "/home/changda/musi6204/vip/embedding_vec.emb"
vocabulary_file = "/home/changda/musi6204/vip/vocabulary.txt"
bookmark_file = "/home/changda/musi6204/vip/bookmark.json"
mental_health = [
'health','healthy','healthcare','healthartifact','hvachealth','healththe',
'wellbeing', 'selfwellbeing','mental', 'stress', 'sleep', 'depression',
'psychological', 'symptoms', 'anxiety', 'care', 'disorders',
'stroke', 'disease', 'cardiovascular','exercise','behavior','feelings',
'physical', 'activity', 'life', 'perceived', 'risk', 'needs',
'healthy', 'attention', 'experience', 'conditions', 'cognitive',
'comfort', 'positive', 'subjective', 'perception', 'satisfaction','fatigue','work','restoration',
'beneficial','vulnerable','deprivation','stressors','medical','engagement','support', 'issues'
]
population = ['adolescents', 'adults','participants', 'children', 'older', 'adulthood', 'caregivers',
'students', 'patient', 'individual', 'poor',
]
urban = [
'urban', 'city', 'cities', 'environment', 'environments', 'built',
'areas', 'spatial', 'design', 'planning', 'development', 'climate',
'community', 'communities', 'spaces', 'natural', 'residents',
'building', 'buildings', 'public', 'transport', 'streets',
'mobility', 'sustainable', 'residential', 'greenspace', 'parks',
'greenery', 'green', 'neighborhood', 'neighbourhood',
'infrastructure', 'urbanization', 'density', 'land', 'regions',
'noise', 'pollution', 'ventilation', 'local', 'housing', 'thermal',
'cycling', 'road', 'soundscapes', 'air', 'greenness', 'water',
'rural', 'wildlife', 'sound', 'equity', 'crime', 'justice',
'accessibility', 'flood', 'nature', 'trees', 'temperature',
'occupants', 'space', 'campus', 'layout', 'private', 'access',
'metro', 'citizens', 'walking', 'travel', 'safety'
]
data_science = [
'data', 'analysis', 'model', 'models', 'machine', 'learning',
'variables', 'algorithm', 'algorithms', 'simulation', 'mapping',
'regression', 'spatiotemporal', 'quantitative', 'database',
'geospatial', 'techniques', 'methods', 'approach', 'framework',
'significant', 'effects', 'results', 'sample', 'correlation',
'significantly', 'modeling', 'survey', 'questionnaire', 'p', 'ci',
'statistics', 'parameters', 'baseline', 'distribution', 'mean',
'system', 'function', 'evaluation', 'coefficients', 'probability',
'classification', 'processing', 'metrics',
'collected', 'assessment', 'interviews', 'characteristics',
'tools', 'measurement', 'experimental','intervention'
]
data = [
'data', 'results', 'study', 'survey', 'information', 'measurements',
'analysis', 'findings', 'statistics', 'number', 'sample',
'database', 'datasets', 'processing', 'time',
'years', 'participants', 'respondents', 'reports','outcomes',
'phase', 'phases', 'transition', 'transitions', 'stages', 'stage'
]
model = gensim.models.KeyedVectors.load_word2vec_format(fname=embedding_vector_file, unicode_errors='strict')
def create_color_bar(min_v = 0, max_v = 5.0, color_map = cm.Reds, bounds = range(6)):
fig, ax = plt.subplots(figsize=(6, 1))
fig.subplots_adjust(bottom=0.5)
norm = mpl.colors.Normalize(min_v, max_v)
if bounds!= None:
cb1 = mpl.colorbar.ColorbarBase(ax, cmap=color_map,
norm=norm,
boundaries = bounds,
orientation='horizontal')
else:
cb1 = mpl.colorbar.ColorbarBase(ax, cmap=color_map,
norm=norm,
orientation='horizontal')
cb1.set_label('relation_strength')
fig.show()
plt.show()
display(HTML("<hr>"))
Create a dictionary of pairs of relations
relations = {
"data_science-population":(data_science,population),
"data_science-data":(data_science, data),
"mental_health-data":(mental_health ,data ),
"mental_health-data_science":(mental_health , data_science),
"population-mental_health ":(population, mental_health ),
"mental_health-data":(mental_health ,data ),
"urban-data":(urban ,data ),
"urban-data_science":(urban ,data_science ),
"urban-population":(urban ,population),
"urban-mental_health":(urban ,mental_health ),
}
N-gram Heatmap Visualization
G = nx.Graph() display(HTML('''
<p>Please refer to Figures 8, 9, and 13 in the article</p>
<b>Colorbar indicating the range of relations strength:</b><br>
0 = very weak relation, 1.0 = very strong relation
''')) create_color_bar()
for k2, rel in relations.items(): table = {}
for ds in rel[1]:
graph = [G.add_edge(ds, node_, weight=1.0) for node_ in ds]
#Computing Word Pair Similarities
d = model.most_similar(ds, topn=1000000)
this_one = {}
for j in rel[0]:
ef_sim = []
for i in d:
if i[0] in j:
ef_sim.append(i[1])
# Check if ef_sim is not empty before taking the max
if ef_sim:
this_one[j] = np.max(ef_sim)
else:
this_one[j] = 0 # Assign a default value if ef_sim is empty
table[ds] = this_one
table2 = {}
for k, v in table.items():
ut = []
for d in rel[0]:
graph2 = [G.add_edge(d, node_2, weight=1.0) for node_2 in d]
try:
#Filtering and Graph Construction
if v[d] > 0.05:
ut.append(v[d])
G.add_edge(d, k, weight=v[d])
else:
ut.append(np.nan)
except Exception as e:
print(str(e))
ut.append(np.nan)
table2[k] = ut
table2["index"] = rel[0]
df = pd.DataFrame(table2)
df = df.set_index("index")
#Data Output
df.to_csv(f"./cc/{k2}.csv")
df.to_csv("./cc/" + k2 + ".csv")
viridis = cm.get_cmap('Reds', 5)
#Sorting Data and Plotting Heatmaps
df_sorted = df.reindex(df.sum().sort_values(ascending=False).index, axis=1)
df_sorted['sum'] = df_sorted.sum(axis=1)
# Adjust the figure size to accommodate labels
plt.figure(figsize=(max(12, float(len(rel[0]) * 0.7)), max(12, float(len(rel[1]) * 0.7))))
df_sorted = df_sorted.sort_values(by="sum", ascending=False)[df_sorted.columns[:-1]]
# Create the heatmap with label adjustments
ax = sns.heatmap(
df_sorted,
cmap="Reds",
vmin=0.0,
vmax=0.5,
square=True,
annot=False,
linewidths=0.1,
linecolor="#fff",
cbar=False,
xticklabels=True, # Ensure x-axis labels are shown
yticklabels=True # Ensure y-axis labels are shown
)
# Adjust x and y labels for readability
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right', fontsize=8)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=7)
# Set plot title
plt.title(k2)
plt.savefig(f"./hcc/{k2}_clusterd.svg")
# Display the heatmap
plt.show()
![data_science-data_clusterd](https://github.com/user-attachments/assets/6d1782dc-63f6-4978-8190-e6df629d62d2)
![data_science-population_clusterd](https://github.com/user-attachments/assets/fcb28f13-d727-4afb-90da-7c6895b38c7c)
![mental_health-data_clusterd](https://github.com/user-attachments/assets/a6e52788-ac30-4f25-9e41-69760f511928)
![mental_health-data_science_clusterd](https://github.com/user-attachments/assets/d7f1c20d-c163-4dc9-a859-c1732473b33c)
![population-mental_health _clusterd](https://github.com/user-attachments/assets/e8216b4f-77e3-4433-bdf2-92e7058606b9)
![urban-data_clusterd](https://github.com/user-attachments/assets/27336dff-60bd-4988-aae4-d893df0ed535)
![urban-data_science_clusterd](https://github.com/user-attachments/assets/82a6d19b-5a23-45a7-ab0c-8b41ed0d4975)
![urban-mental_health_clusterd](https://github.com/user-attachments/assets/f4fea6a9-92d3-4f8f-9497-7eb041ad6e33)
![urban-population_clusterd](https://github.com/user-attachments/assets/008f3234-83df-41d9-987e-85663b2b4e1f)
Tasks are still ongoing?
- Designing workflow diagram of this part.(the relationship between different categories and organize different realtionship of heatmaps)
**Challenges and Learning**
What was the biggest challenge you faced this week? How did you address it?
The biggest challenge is to categorize the keywords in “vocabulaty.txt”. Due to the huge amount of data, I finally used the filtering function of Word to achieve categorization. For the visualization part, I spent a long time debugging the sorting and Filtering and Graph Construction parts, but with the help of the reference, I was able to debug them all.
What's one new thing you learned or skill you improved?
Learned the construction of data analysis based on N-gram similarity and visualized it.
**Reflection and Planning**
Main focus for next week?
1.Finish Heirarchical Clustereing (HAC) and Correlation Matrics visualizaton
2.Review full-text at least 100
Any resources you're looking for?
Not yer
Quick Overview List your top 3 tasks or objectives for this week:
Task 1: Reviewing full-text at least 100 Task 2: Finish Heirarchical Clustereing (HAC) and Correlation Matrics visualizaton
Weekly Accomplishments
What tasks did you complete this week? (Include links)
Question: How to deal with some articles that do not give detailed age information for the population or give a wide range of age information?
[ ] If the proportion of all participants aged 16–60 is greater than 50% or the proportion is greater with >60 and <16 years, they are included.
2.Finish Heirarchical Clustereing (HAC) and Correlation Matrics visualizaton First, word2vec data preprocessing:extraction, classification and storage of word embedding vectors.
# Display table header
display(HTML('''<b> The following table shows each word, its corresponding 300-dimension vector, and its category</b>'''))
# Initialize dictionary for DataFrame
dfdict = {"word": []}
for i in range(1, 301):
dfdict[i] = []
# Process each word, checking if it exists in the model's vocabulary
for i in list(G.nodes):
dfdict["word"].append(i)
if i in model: # Check if the word exists in the model's vocabulary
thisvec = list(model.get_vector(i))
else:
thisvec = [0] * 300 # Use a zero vector if the word is not in the vocabulary
# Ensure vector is 300 dimensions
if len(thisvec) < 300:
thisvec += [0] * (300 - len(thisvec))
elif len(thisvec) > 300:
thisvec = thisvec[:300]
for ix in range(300):
dfdict[ix + 1].append(thisvec[ix])
# Convert dictionary to DataFrame and save as TSV
embd = pd.DataFrame(dfdict).set_index("word")
embd.to_csv("../embedding_matrix.tsv", sep="\t", index=False)
# Define function to assign category
def return_type(x):
categories = {
"mental_health": mental_health,
"urban": urban,
"data": data,
"population": population,
"data_science": data_science
}
for k, v in categories.items():
if x in v:
return k
return None # Return None if no category matches
# Apply category assignment
embd = embd.reset_index()
embd["category"] = embd["word"].apply(lambda x: return_type(x))
embd = embd.set_index("word")
Second, load data, perform clustering, visualize
#load all the realtion_dataframes
files = os.listdir("./cc/")
all_dfs = {}
for f in files:
if ".csv" in f:
all_dfs[f.replace(".csv", "")]=(pd.read_csv("./cc/"+f))
import matplotlib.patches as mpatches
# Display color bar for correlation values
display(HTML('''<b>Colorbar indicating the correlation value</b><br>
0.0 = weak correlation, 1.0 = strong correlation'''))
create_color_bar(min_v = 0.0, max_v =1.0)
# initialize the first graph figure
figN = 1
for key, df in all_dfs.items():
display(HTML(f"{figN}- <b>{key.split('-')[1]} category</b> hvc clustering."))
# Set index and prepare data for clustering
df = df.set_index("index")
wordsHere = list(df.columns)
words_vectors = [model.get_vector(i) for i in wordsHere]
words_vec_df = pd.DataFrame({"X": words_vectors, "y": wordsHere})
X = [list(x) for x in words_vec_df['X']]
labels_2 = list(words_vec_df['y'])
# Compute linkage and set color threshold for clusters
Z = sch.linkage(X, method="ward")
color_threshold = 0.8 * max(Z[:, 2]) # Adjust threshold for colors
# Plot dendrogram with colors
fig, ax = plt.subplots(figsize=(len(X)/5.0, 1))
dendrogram = sch.dendrogram(
Z,
color_threshold=color_threshold,
labels=labels_2,
orientation='top',
ax=ax,
leaf_rotation=90
)
ax.tick_params(axis='x', which='major', labelsize=10)
ax.tick_params(axis='y', which='major', labelsize=8)
# Generate legend based on unique colors in dendrogram
color_list = dendrogram['color_list']
unique_colors = list(set(color_list))
patches = [mpatches.Patch(color=color, label=f'Cluster {i+1}') for i, color in enumerate(unique_colors)]
plt.legend(handles=patches, bbox_to_anchor=(1.05, 1), loc='upper left', title="Clusters")
# Save dendrogram
plt.savefig(f"./hvc/{key}_clusterd_with_legend.svg", bbox_inches='tight')
plt.show()
# Calculate correlation matrix and plot heatmap
df = df.drop_duplicates()
dd = df[dendrogram["ivl"]] # Reorder columns according to the dendrogram order
corr = dd.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(len(X)/1.0, len(X)/1.5), dpi=300)
plt.title(f"{key.split('-')[1]} < -- > {key.split('-')[1]}")
sns.heatmap(
corr,
cmap="Reds",
vmin=0.0,
vmax=1.0,
cbar=False,
square=True,
linewidth=0.5
)
# Save reordered dataframe and correlation matrix
dd.to_csv(f"./cd/{key}.csv")
plt.savefig(f"./Correlation_matrix/{key}.svg")
plt.show()
figN += 1
Tasks are still ongoing? Design and organize the relationship between Heirarchical Clustereing (HAC) and Correlation Matrics visualizaton Continue to focusing on the full-text review.
Challenges and Learning
What was the biggest challenge you faced this week? How did you address it?
Z = sch.linkage(X, method="ward")
dd = df[dendrogram["ivl"]]
corr = dd.corr()
The leaf order is not determined in the code and the consistency of the random seed leads to different correlation heatmaps
What's one new thing you learned or skill you improved?
Learn methiod of Heirarchical Clustereing (HAC) and Correlation Metrics
**Reflection and Planning**
Main focus for next week?
1.Finish full-text review
2.Finish Cross realtion between different categories
Any resources you're looking for?
Not yet
Quick Overview List your top 3 tasks or objectives for this week:
Task 1: Solve the problem of inconsistent or unclear descriptions of the population age in the literature during the screening process. Task 2: Finish the full text reivew Task 3: Complete the visualization of Cross realtion between different categories
Weekly Accomplishments
What tasks did you complete this week? (Include links)
Problem Type:
[ ] - 1.Some of the literature does not specifically describe the age distribution of the study subjects.
[ ] - The age and demographic characteristics of the study subjects were not described in sufficient detail.
[ ] - The study subjects included very few children or elderly people, meaning that the majority of subjects were aged 16-60.
Solution:
[ ] - For documents in which a very small number of children or elderly people are included in the research subjects, the current solution is to retain these documents, but the criteria for handling such cases need to be further clarified.
[ ] - When writing the system literature review, it was pointed out that research in the field of architecture is relatively loose in terms of describing the research population and standardizing data compared to fields such as psychology or psychiatry, and calls for stricter standards in the field.
2.Complete the full-text review screening
3.Complete the visualization of Cross realtion between different categories
display(HTML('''<b>Colorbar indicating the cross realtion values between each two categories</b><br>
<p>These values are extracted from the cosine similarity metric </p>
0.0 = weak correlation, 0.5 = strong correlation'''))
all_orders={}
create_color_bar(min_v = 0.0, max_v = 0.5, bounds=list(np.array(range(6))/10.))
figN = 1
for key, dd in all_dfs.items():
display(HTML(str(figN) + "- "+ key.replace("-", "< -- >")))
dd2 = pd.read_csv("./cd/"+key+".csv").set_index("index")
key_part = key.split("-")[0]
if key_part in all_orders:
dd2 = dd2.reindex(all_orders[key_part])
else:
print(f"Warning: '{key_part}' not found in all_orders, skipping reindexing.")
plt.figure(figsize=(len(dd2.columns)/1.0,len(dd2.index)/1.0), dpi=300)
plt.title(key.replace("-", "< -- >"))
sns.heatmap(dd2,
cmap="Reds",
square=True,
vmin=0.0,
vmax=0.5,
linewidth=0.5,
cbar=False
)
plt.savefig(f"./Cross_realtion_matrix/{key}_cross_rel.svg")
plt.show()
figN +=1
Tasks are still ongoing? Reorganzie the relationship between cross corelation heatmap and HAC.heatmap as ar final result.
Challenges and Learning
What was the biggest challenge you faced this week? How did you address it?
Explain the difference between N-gram similarity and cross correlation. In other words, if the N-gram has already concluded a similarity analysis between different relations, why analyze cross correlation? The difference between the two is that N-gram is based on the word vector itself, that is, frequency statistics and probability calculations to derive the similarity, while cross correlation is based on mathematical statistical methods (cosine similarity) to calculate.
What's one new thing you learned or skill you improved? Using cosine similarity method to visualize the relationship between diferent relationship Understanding the difference between cross corelation and N-gram simiarity
Reflection and Planning
Main focus for next week?
1Making Data extaction templete draft and draft the structure of "Introduction" and "method" part 2.Complete Word embeddings 2d projection visualization
Any resources you're looking for? Not yet
Quick Overview List your top 3 tasks or objectives for this week:
Task 1: Complete the draft of data extraction part Task 2: Complete Word embeddings 2d projection visualization Task 3: Complete the structure of "Introduction" and "method" part
Weekly Accomplishments
What tasks did you complete this week? (Include links)
Before I start this, I read and watch the video how to extract data from selected full-text website link:https://support.covidence.org/help/data-extraction-1-overview
Templete link:https://docs.google.com/document/d/1nDGSpnAZHablzTy4REJLtIr2p6__iqVWld0YLeiokEA/edit?usp=sharing
2.Draft the structure of"Introduction" and "Method" part https://gtvault-my.sharepoint.com/:w:/g/personal/cma326_gatech_edu/EQWfDwRri6hNlsvWU0vcVVcB9ARsGi1TS0aJ9umymZSE4A?e=UiZzhy
3.Complete Word embeddings 2d projection visualization
# Load the bookmark JSON
with open("/home/changda/musi6204/vip/bookmark.json", 'r') as bm:
bookmark = json.loads(bm.read())
# Load the TSV file, assuming the first column is "word" and the second column is "category", followed by embedding values
embedding_columns = ["word", "category"] + [f"embedding_{i}" for i in range(1, 101)]
embd = pd.read_csv("labels.tsv", sep='\t', header=None, names=embedding_columns)
# Access projections directly under 'root'
word_pos = pd.DataFrame(bookmark['projections'])
word_pos["word"] = embd.reset_index()["word"] # Ensure index reset and assign words
word_pos = word_pos.set_index("word")
# Check for duplicate index values in both DataFrames
if embd["word"].duplicated().any():
print(f"Duplicate words found in embd: {embd['word'].duplicated().sum()} duplicates")
embd = embd.drop_duplicates(subset="word") # Drop duplicates in embd
if word_pos.index.duplicated().any():
print(f"Duplicate words found in word_pos: {word_pos.index.duplicated().sum()} duplicates")
word_pos = word_pos[~word_pos.index.duplicated(keep="first")] # Drop duplicates in word_pos
# Align indexes between embd and word_pos
common_index = pd.Index(embd["word"].astype(str)).intersection(word_pos.index)
# Update embd and word_pos to only keep common words
embd = embd.set_index("word").loc[common_index] # Keep only common words in embd
word_pos = word_pos.loc[common_index] # Keep only common words in word_pos
# Combine embeddings and projections
embd_with_word_pos = pd.concat([embd, word_pos], axis=1)
# Process data for visualization
# Generate default x and y for visualization if 'umap-0' and 'umap-1' don't exist
if "umap-0" not in embd_with_word_pos.columns or "umap-1" not in embd_with_word_pos.columns:
embd_with_word_pos["umap-0"] = embd_with_word_pos.iloc[:, 2] # Example: Use the third column as x
embd_with_word_pos["umap-1"] = embd_with_word_pos.iloc[:, 3] # Example: Use the fourth column as y
# Prepare data for visualization
xy = embd_with_word_pos[["umap-0", "umap-1", "category"]].rename({"umap-0": "x", "umap-1": "y"}, axis=1)
# Map categories to colors
category_to_color = {"mental_health": "#F15A22", "population": "#6DC8BF", "urban": "#B72467","data science": "#CBDB2A","data": "#FFA07A",}
xy["color_p"] = xy["category"].map(category_to_color).fillna("#000000") # Default to black if category is not in palette
# Resulting DataFrame `xy` is ready
print(xy.head())
G = nx.Graph()
for node in xy.index:
G.add_node(node)
for i, node1 in enumerate(xy.index):
for node2 in xy.index[i+1:]:
G.add_edge(node1, node2)
node_degrees = nx.degree(G)
nx.set_node_attributes(G, "degree", node_degrees)
graph_colors = xy[["color_p"]].to_dict()["color_p"]
xy["pos"] = xy.apply(lambda x : (x["x"], x["y"]), axis =1)
graph_pos = xy["pos"].to_dict()
···
- Extract information about the nodes in the data and label the nodes in the main vocabulary.
```# Get a list of words containing each category from the xy DataFrame
population = xy[xy["category"] == "population"].index.tolist()
data = xy[xy["category"] == "data"].index.tolist()
data_science = xy[xy["category"] == "data science"].index.tolist()
mental_health = xy[xy["category"] == "mental_health"].index.tolist()
urban = xy[xy["category"] == "urban"].index.tolist()
# List of words from all categories
all_main_words = set(population + data + data_science + mental_health + urban)
# Create the main_words list and check whether the nodes belong to the main vocabulary.
main_words = []
for k, v in G.degree():
if k in all_main_words:
main_words.append(k)
else:
main_words.append("")
import matplotlib.pyplot as plt
from adjustText import adjust_text
distance_threshold = 0.25
positions = xy[["x", "y"]].values dist_matrix = distance_matrix(positions, positions)
edges = np.argwhere((dist_matrix < distance_threshold) & (dist_matrix > 0))
degree = dict(G.degree())
plt.figure(figsize=(20, 15), dpi=300) nx.draw_networkx_nodes( G, pos=graph_pos, node_color=[v for k, v in graph_colors.items()], node_size=[v * 1.3 for k, v in degree.items()], alpha=0.6 )
for i, j in edges:
plt.plot(
[positions[i][0], positions[j][0]],
[positions[i][1], positions[j][1]],
"k-", alpha=0.2, linewidth=0.8
)
texts = [] for indx, i in enumerate(main_words[:]): if i != "": texts.append(plt.text(xy.reset_index().loc[indx]["x"], xy.reset_index().loc[indx]["y"], i)) adjust_text(texts, only_move={'texts': 'x'}, arrowprops=dict(arrowstyle="-", color='k', lw=0.7))
plt.savefig(".graph_embeddings_projection.svg") plt.show()
![graph_embeddings_projection_with_edges](https://github.com/user-attachments/assets/f3dc6a29-d8b6-464e-876f-e5487b958176)
visualization: option2 based on distance_threshold and partition
import community as community_louvain
distance_threshold = 0.25
positions = xy[["x", "y"]].values dist_matrix = distance_matrix(positions, positions)
graph = nx.Graph() graph.add_nodes_from(range(len(positions)))
edges = np.argwhere((dist_matrix < distance_threshold) & (dist_matrix > 0)) graph.add_edges_from(edges)
partition = community_louvain.best_partition(graph)
num_communities = max(partition.values()) + 1 color_map = cm.get_cmap('tab20', num_communities)
node_colors = [color_map(partition[node]) for node in graph.nodes()]
pos = nx.spring_layout(graph, k=0.1, seed=42)
plt.figure(figsize=(20, 15), dpi=300)
nx.draw_networkx_nodes( graph, pos=pos, node_color=node_colors, node_size=100, alpha=0.8 )
nx.draw_networkx_edges( graph, pos=pos, edgelist=edges, edge_color='grey', alpha=0.3, width=0.5 )
texts = [] for indx, word in enumerate(main_words): if word != "": x, y = pos[indx] texts.append(plt.text(x, y, word, fontsize=8)) adjust_text(texts, only_move={'texts': 'xy'}, arrowprops=dict(arrowstyle="-", color='k', lw=0.5))
plt.savefig("./graph_embeddings_projection_with_communities.svg") plt.show()
![graph_embeddings_projection_with_communities](https://github.com/user-attachments/assets/0bb7fb71-24f4-4e01-926a-498267905c24)
Tasks are still ongoing?
Replete with some detail of data extraction templete draft.
Improve more details of the structure of "Introduction" and "method" part
**Challenges and Learning**
What was the biggest challenge you faced this week? How did you address it?
As for word vector projection, I don't know about that. So I learn from the reference:https://github.com/ideas-lab-nus/data-science-bldg-energy-efficiency
What's one new thing you learned or skill you improved?
Understand how to use word2vec to visualize profection.
Week1
Quick Overview List your top 3 tasks or objectives for this week:
Task 1:Read the “Designing for human wellbeing: The integration of neuroarchitecture in design – A systematic review”
Task 2:Read the “Neuroarchitecture: How the Perception of Our Surroundings Impacts the Brain”
Task 3:Summarize the contents of these two articles and draw two mind maps
Weekly Accomplishments
Challenges and Learning
What was the biggest challenge you faced this week? How did you address it? 1.Neuroscience terminology and cognitive deficiencies in basic knowledge 2.The relationship between vision, hearing, smell, taste and the nerve groups in the brain takes time to understand. 3.Quantitative emotion models or neural activity models.
What's one new thing you learned or skill you improved? 1.New things:Mirror neuron responses are closely related to the actual interactive potential of a space. Architectural space design can enhance the experience by directing perceptual and motor responses.In architectural design. 2.The ability to quickly skim through an article and summarize its main points.
Did you attend any team meetings? Key takeaways? I attended the first Neuroarchitecture zoom meeting on Wednesday, and learned how to use Github, Zotero, and Dropbox f or the VIP purposes. Also learned more about the related topics for future research by the "neroarchitecture" sub team.
Reflection and Planning