VIP-SMUR / 24Fa-Neuroarchitecture

Neuroarchitetcture
1 stars 2 forks source link

Changda Weekly Notebook Entry #2

Open changdama opened 2 months ago

changdama commented 2 months ago

Week1

Quick Overview List your top 3 tasks or objectives for this week:

Weekly Accomplishments

Challenges and Learning

Reflection and Planning

changdama commented 2 months ago

Week2

Quick Overview List your top 3 tasks or objectives for this week:

Weekly Accomplishments

Reflection and Planning

changdama commented 2 months ago

Week3

Quick Overview List your top 3 tasks or objectives for this week:

Weekly Accomplishments

  1. Read paper of Neuroarchitecture related to urban design like Neuro-urbanism. Neuroscience and the Cities_ Neurourbanism[#1140128]-2523630.pdf Neuro-adaptive architecture Buildings and city design that respond to human emotions, cognitive states 1-s2.0-S2590051X24000315-main.pdf

After reading this week's literature review, I found that I gradually understood the structure of the literature review, and found that my focus on "neuroarchitecture" is the nero-urban system. I also learned more about the application of data-driven methods in this field, such as EGG, fMRI, VR, and other body feature signals. Challenges and Learning

  1. Found that my focus on "neuroarchitecture" is the nero-urban system using evidence based design related to"circular economy.
  2. learned more about the application of data-driven methods in this field, such as EGG, fMRI, VR, and other body feature signals. Understand the application and cases of these technologies in this field in the past 10 years。

I attended the first Neuroarchitecture zoom meeting on Wednesday, and present last week progress ,some results of different databases and learn about how to write literature. Also we met in person on Thursday 11am and joined a Teams meeting on Sunday @ 3:30pm to conclude search terms.

Reflection and Planning

1.After finalizing the keywords, divide the task of searching for papers based on content, quantity, and importance, and import them into Covidence. 2.In Covidence, establish the "Eligibility Criteria" to determine the factors for "exclude" and "include." 3.Conduct the screening of the searched papers through "title and abstract screening" -> "Full text review" -> "Extraction."

Tutorial for EEG and VR

changdama commented 1 month ago

Week4

Quick Overview List your top 3 tasks or objectives for this week:

Weekly Accomplishments

2.I learned about the visualization principles of text mining and NLP methods presented in the article "Data Science for Building Energy Efficiency: A Comprehensive Text-Mining Driven Review of Scientific Literature". image phase 1. Data collection and preprocessing stage: Call Elsevier api 30000 articles, including abstracts, titles, full texts and keywords. Data cleaning through “NLTK”; conversion to lowercase, deletion of unimportant stop words, stemming and reduction, and classification of keywords. phase 2. Train using Word2Vec model, analyze semantic similarity, and generate an emb file image phase3. Using the emb file, predict the most similar keywords in each category (e.g., data category, data science category, etc.) based on the existing keywords in the category. image image Phase 4: Generate a histogram of similarities image image

1.The search terms were rechecked by Dr. Haas and Dr. Kastner. 2.By generalizing the methods in the paper, a semantic similarity analysis heat map was generated between mental health or well-being and urban.

Challenges and Learning

  1. The ability to organize search terms and to learn and analyze the structure of data visualization
  2. Arrange meetings for the team, seek the opinions of classmates, and coordinate the distribution of weekly reporting ppt work.

I attended the first Neuroarchitecture zoom meeting on Wednesday, and present last week progress ,presenting search terms of individual and groups. Also arrange a Teams meeting on Saturday @ 11:00pm to re-organize search terms.

Reflection and Planning

1.The re-organizing terms will be confirmed by Dr. Haas and Dr. Kastner. 2.Search the results of key words we confirmed by using Boolean Formula in different databases. 3.Try to call PubMed API, and use NLTK to clean data in Jupyter Notebook.

How to call PubMed API.

changdama commented 1 month ago

Week5

Quick Overview List your top 3 tasks or objectives for this week:

Weekly Accomplishments

TITLE ( "mental health" OR "mental-health" OR "well-being" OR "well being" OR wellbeing ) AND TITLE ( "built environment" OR "building architecture" OR "architectural design" OR "building design" OR "environmental design" OR "urban architecture" OR "urban environment" OR "sustainable architecture" ) AND NOT DOCTYPE ( re ) AND NOT DOCTYPE ( "ma" ) AND PUBYEAR > 2013 AND PUBYEAR < 2025 AND PUBYEAR > 2013 AND PUBYEAR < 2025 AND PUBYEAR > 2013 AND PUBYEAR < 2025 AND ( LIMIT-TO ( SRCTYPE , "j" ) OR LIMIT-TO ( SRCTYPE , "p" ) ) AND ( LIMIT-TO ( PUBSTAGE , "final" ) OR LIMIT-TO ( PUBSTAGE , "aip" ) ) AND ( LIMIT-TO ( SUBJAREA , "SOCI" ) OR LIMIT-TO ( SUBJAREA , "ENGI" ) OR LIMIT-TO ( SUBJAREA , "ENVI" ) OR LIMIT-TO ( SUBJAREA , "COMP" ) OR LIMIT-TO ( SUBJAREA , "ENER" ) OR LIMIT-TO ( SUBJAREA , "ARTS" ) OR LIMIT-TO ( SUBJAREA , "PSYC" ) OR LIMIT-TO ( SUBJAREA , "HEAL" ) OR LIMIT-TO ( SUBJAREA , "MULT" ) OR LIMIT-TO ( SUBJAREA , "MATE" ) OR LIMIT-TO ( SUBJAREA , "NEUR" ) OR LIMIT-TO ( SUBJAREA , "MEDI" ) ) AND ( LIMIT-TO ( DOCTYPE , "ar" ) OR LIMIT-TO ( DOCTYPE , "cp" ) ) AND ( LIMIT-TO ( LANGUAGE , "English" ) )

Deep Link: https://www.scopus.com/results/results.uri?sort=plf-f&src=s&sid=7c305d16f6100afc7988aa507ba41f90&sot=a&sdt=cl&cluster=scosrctype%2C%22j%22%2Ct%2C%22p%22%2Ct%2Bscopubstage%2C%22final%22%2Ct%2C%22aip%22%2Ct%2Bscosubjabbr%2C%22SOCI%22%2Ct%2C%22ENGI%22%2Ct%2C%22ENVI%22%2Ct%2C%22COMP%22%2Ct%2C%22ENER%22%2Ct%2C%22ARTS%22%2Ct%2C%22PSYC%22%2Ct%2C%22HEAL%22%2Ct%2C%22MULT%22%2Ct%2C%22MATE%22%2Ct%2C%22NEUR%22%2Ct%2C%22MEDI%22%2Ct%2Bscosubtype%2C%22ar%22%2Ct%2C%22cp%22%2Ct%2Bscolang%2C%22English%22%2Ct&sessionSearchId=7c305d16f6100afc7988aa507ba41f90&origin=resultslist&editSaveSearch=&txGid=ddd06a9e6ade9181089d9eb99d95e953&limit=10&s=TITLE%28%22mental+health%22+OR+%22mental-health%22+OR+%22well-being%22+OR+%22well+being%22+OR+wellbeing%29+AND+TITLE%28%22built+environment%22+OR+%22building+architecture%22+OR+%22architectural+design%22+OR+%22building+design%22+OR+%22environmental+design%22+OR+%22urban+architecture%22+OR+%22urban+environment%22+OR+%22sustainable+architecture%22%29+AND+NOT+DOCTYPE%28re%29+AND+NOT+DOCTYPE%28%22ma%22%29+AND+PUBYEAR+%26gt%3B+2013+AND+PUBYEAR+%26lt%3B+2025+AND+PUBYEAR+%26gt%3B+2013+AND+PUBYEAR+%26lt%3B+2025&yearFrom=2014&yearTo=2024

Results: 145 results

2.Fix the search criteria of Google Scholar Boolean Formula: allintitle: ("mental health" OR "well being" OR "wellbeing") AND ( "built environment" OR "building architecture" OR "architectural design" OR "building design" OR "environmental design" OR "urban architecture" OR "urban environment" OR "sustainable architecture") -review -"meta analysis"

Deep Link: https://scholar.google.com/scholar?as_vis=1&q=allintitle:+(%22mental+health%22+OR+%22well+being%22+OR+%22wellbeing%22)+AND+(+%22built+environment%22+OR+%22building+architecture%22+OR+%22architectural+design%22+OR+%22building+design%22+OR+%22environmental+design%22+OR+%22urban+architecture%22+OR+%22urban+environment%22+OR+%22sustainable+architecture%22)+-review+-%22meta+analysis%22&hl=en&as_sdt=0,11

Results: 301 results

  1. continue to learn how to use E-utilities(PubMed Api) https://www.ncbi.nlm.nih.gov/books/NBK25500/
  2. continue try to wirte code for txt and emv training skelton
    
    import nltk
    nltk.download('stopwords')
    nltk.download('punkt')
    from Bio import Entrez

PubMed Api

Search keywords from PubMed

def fetch_pubmed(query, retmax=100): handle = Entrez.esearch(db="pubmed", term=query, retmax=retmax) record = Entrez.read(handle) ids = record["IdList"] handle.close() return ids

Get article abstracts

def fetch_abstracts(id_list): abstracts = [] keywords= ['health', 'urban', ''] for pubmed_id in id_list: handle = Entrez.efetch(db="pubmed", id=pubmed_id, rettype="abstract", retmode="text") if api.keyword in keywords:

    abstract = handle.read()
    abstracts.append(abstract)
    handle.close()
return abstracts

pubmed_ids = fetch_pubmed("well being", retmax=100) abstracts = fetch_abstracts(pubmed_ids) import re from nltk.corpus import stopwords from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

def clean_text(text):

Remove special characters and punctuation

text = re.sub(r'[^\w\s]', '', text)
# Convert to lowercase
text = text.lower()
# Word Tokenization
words = word_tokenize(text)
# Remove stop words
words = [word for word in words if word not in stop_words]
return words

Clean all abstracts

cleaned_abstracts = [clean_text(abstract) for abstract in abstracts] from gensim.models import Word2Vec

Word2Vec model training

model = Word2Vec(sentences=cleaned_abstracts, vector_size=100, window=5, min_count=1, workers=4)

get vocabulary.txt

vocabulary = model.wv.index_to_key

save

with open("vocabulary.txt", "w") as f: for word in vocabulary: f.write(f"{word}\n")



- Current status?
After importing the search results into Covidence, I began discussing[ eligibility criteria](https://app.covidence.org/reviews/408264/criteria) with my team members and scheduled an in-person discussion for 11:30 a.m. on October 10, 2024.
Plan to create a table that shares eligibility criteria.
**Challenges and Learning**

- What was the biggest challenge you faced this week? How did you address it?
1.How to get bookmark.json file  based on the github which Dr. Kastner give for data visualization.?The paper doen't mention that.
![image](https://github.com/user-attachments/assets/0bf440c5-28ce-441b-95ba-08741b88ae82)

- What's one new thing you learned or skill you improved?

1. Use different boolean formulas with keywords to search different databases for literature – the basis of a literature review.
2. Learned to write code for data cleaning with natural language(nltk ) and the Word2Vec training model

- Did you attend any team meetings? Key takeaways?

I attended the first Neuroarchitecture zoom meeting on Wednesday, and present last week progress .Reported on their own search results and learned that the main task for the next week would be to develop eligibility criteria.

**Reflection and Planning**

- Your progress this week?
Search Scopus Results using boolean formula of Scopus. For Data visualzation, write the code of data cleaning using NITK and the Word2Vec training model.

- Main focus for next week?

1.Making the excel for eligibility criteria
2.Finishing the api task, try to get Vocabulary.txt 

- Any resources you're looking for?

How to get bookmark.json file  based on the github which Dr. Kastner give for data visualization.
changdama commented 3 weeks ago

Week6

Quick Overview List your top 3 tasks or objectives for this week:

Task 1: Complete the eligibility criteria on Google Sheets. Task 2: Search some system review as reference in order to finish the sheets Task 3: Learn the writing standards of eligibility criteria in Covidence.

Weekly Accomplishments

What tasks did you complete this week? (Include links)

  1. Edit the original draft of the eligibility criteria and have other students write it. After they finish writing, I organized that. Eligibility Criteria Sheet Link: https://docs.google.com/spreadsheets/d/1R-Cdp6ns-DDpA-fVtdIXbcMVaq5Yah8qXCIaXxxj9yc/edit?usp=sharing

2.Looking for some examples how to write critriea and arrange a meeting to tell others. 1-s2.0-S0277953621005748-main.pdf 1-s2.0-S1353829217308869-main.pdf 1-s2.0-S2405844024137073-main.pdf fpsyt-12-758039.pdf Journal of Environmental and Public Health - 2020 - Núñez-González - Overview of Systematic Reviews of the Built.pdf

3.Learn the writing standards of eligibility criteria in Covidence. Learning Link: https://support.covidence.org/help/how-to-create-and-manage-eligibility-criteria#population

Tasks are still ongoing? Continue to process data cleaning to get vocabulary.txt

It's too difficult to use PubMed API, I change to use Scopus api to continue.

Challenges and Learning

What was the biggest challenge you faced this week? How did you address it? When referencing the API, an error occurs: pybliometrics has not been initialized with a configuration file. Even though I have configured and downloaded “pybliometrics”, I still cannot find the configuration file. This may be a problem with the permissions of the Docker virtual environment, so you need to manually reference the API to perform the following steps.

`FileNotFoundError                         Traceback (most recent call last)
Cell In[9], line 29
     26     return abstracts
     28 # Fetch article EIDs based on search term
---> 29 scopus_eids = fetch_scopus("well being", count=100)
     30 abstracts = fetch_abstracts(scopus_eids)
     32 # Define stop words

Cell In[9], line 14, in fetch_scopus(query, count)
     13 def fetch_scopus(query, count=100):
---> 14     s = ScopusSearch(query, subscriber=True, api_key=scopus_api_key)
     15     return s.get_eids()[:count]

File /usr/local/lib/python3.9/dist-packages/pybliometrics/scopus/scopus_search.py:214, in ScopusSearch.__init__(self, query, refresh, view, verbose, download, integrity_fields, integrity_action, subscriber, unescape, **kwds)
    212 self._query = query
    213 self._view = view
--> 214 Search.__init__(self, query=query, api='ScopusSearch', size=size,
    215                 cursor=subscriber, download=download,
    216                 verbose=verbose, **kwds)
    217 self.unescape = unescape

File /usr/local/lib/python3.9/dist-packages/pybliometrics/scopus/superclasses/search.py:61, in Search.__init__(self, query, api, size, cursor, download, verbose, **kwds)
     59 stem = md5(name.encode('utf8')).hexdigest()
     60 # Get cache file path
---> 61 config = get_config()
     62 parent = Path(config.get('Directories', api))
     63 self._cache_file_path = parent/self._view/stem

File /usr/local/lib/python3.9/dist-packages/pybliometrics/scopus/utils/startup.py:75, in get_config()
     73 """Function to get the config parser."""
     74 if not CONFIG:
---> 75     raise FileNotFoundError('No configuration file found.'
     76                             'Please initialize Pybliometrics with init().\n'
     77                             'For more information visit: '
     78                             'https://pybliometrics.readthedocs.io/en/stable/configuration.html')
     79 return CONFIG

FileNotFoundError: No configuration file found.Please initialize Pybliometrics with init().
For more information visit: `https://pybliometrics.readthedocs.io/en/stable/configuration.html`

What's one new thing you learned or skill you improved? Learned the method of referencing the scopus api and the referencing rules for the eligiblity critriea in covidence Did you attend any team meetings? Key takeaways? I attended the Neuroarchitecture zoom meeting on Wednesday, and present last week progress. I arrange a meeting to introduce how to fill the critriea sheet for everyone.

Reflection and Planning

Main focus for next week?

1.finish data cleaning 2.Conclude eligibility critriea on covidence with teams and Dr. kastner and Dr. Haas.

Any resources you're looking for? How to get bookmark.json file based on the github which Dr. Kastner give for data visualization.

changdama commented 2 days ago

Week7

Quick Overview List your top 3 tasks or objectives for this week:

Task 1: Start Title and abstract screening, Screen at least 50 papers per person Task 2: Complete using scopus api to clean the data and get vocabulary.txt Task 3: Import abstracts on papers with missing abstracts

Weekly Accomplishments

What tasks did you complete this week? (Include links) 1.Solved the problem of introducing the Scopus API last week, using the Nltk library, search Scopus for articles, etch abstracts, titles, keywords, full-text and other data from Scopus, Define complex query with boolean formula, stop words cleaning and finally get voc.txt, which represents the frequency of key words in all articles searched according to the boolean formula, and records the number of articles.

import requests
import nltk
from collections import Counter
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

# Scopus API key
scopus_api_key = "3a2d947799515fbd27b82a851d8bab0e"

# Function to search Scopus for articles
def fetch_scopus(query, count=200):
    url = "https://api.elsevier.com/content/search/scopus"
    headers = {
        "X-ELS-APIKey": scopus_api_key,
        "Accept": "application/json"
    }
    params = {
        "query": query,
        "count": count
    }
    response = requests.get(url, headers=headers, params=params)
    if response.status_code == 200:
        results = response.json()
        eids = [entry['eid'] for entry in results.get("search-results", {}).get("entry", [])]
        return eids
    else:
        print(f"Error: {response.status_code}, {response.text}")
        return []

# etch abstracts, titles, keywords, full-text and other data from Scopus
def fetch_article_data(eids):
    articles_data = []
    keywords = ['mental health', 'mental-health', 'well-being', 'well being', 'wellbeing', 
                'built environment', 'building architecture', 'architectural design', 
                'building design', 'environmental design', 'urban architecture', 
                'urban environment', 'sustainable architecture']
    # For each EID, request article details from the Scopus API
    for eid in eids:
        url = f"https://api.elsevier.com/content/abstract/eid/{eid}"
        headers = {
            "X-ELS-APIKey": scopus_api_key,
            "Accept": "application/json"
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            article = response.json()
            coredata = article.get("abstracts-retrieval-response", {}).get("coredata", {})

            # Retrieve title, abstract, and keywords
            title = coredata.get("dc:title", "")
            abstract = coredata.get("dc:description", "")
            keywords_list = article.get("abstracts-retrieval-response", {}).get("authkeywords", None)

            # Join keywords if they are not None
            keywords_text = " ".join([kw for kw in keywords_list if kw is not None]) if keywords_list else ""

            # Combine title, abstract, and keywords for processing
            full_text = f"{title} {abstract} {keywords_text}"

            # Filter based on the presence of keywords in the full text
            if any(keyword in full_text.lower() for keyword in keywords):
                articles_data.append(full_text)
        else:
            print(f"Error fetching data for {eid}: {response.status_code}, {response.text}")
    return articles_data

# Define complex query with boolean formula
query = '("mental health" OR "mental-health" OR "well-being" OR "well being" OR wellbeing) AND ("built environment" OR "building architecture" OR "architectural design" OR "building design" OR "environmental design" OR "urban architecture" OR "urban environment" OR "sustainable architecture")'

# Fetch article EIDs based on the query
scopus_eids = fetch_scopus(query, count=200)
articles_data = fetch_article_data(scopus_eids)

# Define stop words
stop_words = set(stopwords.words('english'))

# Function to clean text
def clean_text(text):
    # Remove special characters and punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Convert to lowercase
    text = text.lower()
    # Word Tokenization
    words = word_tokenize(text)
    # Remove stop words
    words = [word for word in words if word not in stop_words]
    return words

# Clean all articles data and flatten the list of words
all_words = [word for article in articles_data for word in clean_text(article) if article]

# Count word frequencies
word_counts = Counter(all_words)

# Save vocabulary with counts to file
with open("vocabulary.txt", "w") as f:
    for word, count in sorted(word_counts.items(), key=lambda item: item[1], reverse=True):
        f.write(f"{word} {count}\n")

vocabulary.txt

2.Screen 100 title and abstract,

3.Solve the problem o fthe title and abstract that have been imported without an abstract Reason:

Tasks are still ongoing?

Challenges and Learning

What was the biggest challenge you faced this week? How did you address it? The first challenge is how to use Scopus API and fix the problem from last week, I search "find solution on github" handle API Responses, construct and send API requests. https://github.com/alistairwalsh/scopus

The second challenge is solving problem of the title and abstract that have been imported without an abstract Two strategy:

image Go to different platforms to search this title or abstract image After seacrhing that, export this as *.ris, and re-import into Covidence again. image The original title without abstract is manually considered a duplicate image image

What's one new thing you learned or skill you improved? If I encounter a code problem that i can't solve, try searching for the answer on GitHub. Learn some skills like "Ctrl+F" input exclude critriea, quickly filter title and abstract which we have to exclude.

Reflection and Planning

Main focus for next week?

1.finish training Word2Vec Model and Generate Bookmark.json 2.Complete all title and abstract reviews

Any resources you're looking for? How to use Word2Vec Model and Generate Bookmark.json.

changdama commented 2 days ago

Week8

Quick Overview List your top 3 tasks or objectives for this week:

Task 1: Complete all title and abstract reviews Task 2: Using Word2Vec Model to imap words to a vector space and capture the contextual relationships between words, so that semantically similar words are closer together in the vector space. Task 3: Extract the pre-trained embedding vector file and use a dimensionality reduction algorithm (PCA, t-SNE, UMAP) to reduce high-dimensional data to a two-dimensional representation, and finally generate a JSON file (bookmark.json) containing the dimensionality reduction results for visualization or further analysis.

Weekly Accomplishments

What tasks did you complete this week? (Include links)

  1. Training Word2Vec Model, in order to get "_embeddingvec.emb". Refernce:https://github.com/danielfrg/word2vec
    
    #  Import the gensim library
    import gensim
    from gensim.models import Word2Vec
    import logging

Input vocabulary.txt file

Assume each line contains a word or phrase, separate the words in each line with spaces.

with open('vocabulary.txt', 'r', encoding='utf-8') as f: sentences = [line.strip().split() for line in f]

Training the Word2Vec model

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

Output embedding_vec.emb

model.wv.save_word2vec_format('embedding_vec.emb', binary=False)

LOAD Word2vec model

model = gensim.models.KeyedVectors.load_word2vec_format(fname='embedding_vec.emb', unicode_errors='strict')


2. Extract the embedding vector file and  reduce high-dimensional data to a two-dimensional representation to get "_bookmark.json_". Reference: https://github.com/turbomaze/word2vecjson

import gensim import json import numpy as np from sklearn.decomposition import PCA from sklearn.manifold import TSNE import umap import re

1. input embedding_vec.emb

embedding_file = "embedding_vec.emb" model = gensim.models.KeyedVectors.load_word2vec_format(embedding_file, binary=False)

2. Load the vocabulary from the vocabulary.txt file, extract only the word parts, and remove additional information such as numbers

with open("vocabulary.txt", "r", encoding="utf-8") as f: vocabulary = [re.sub(r'\s+\d+$', '', line.strip().lower()) for line in f if line.strip()]

3. Get embedding vectors for each word

embeddings = [] not_found_words = [] for word in vocabulary: if word in model: embeddings.append(model[word]) else: not_found_words.append(word)

Convert to numpy array and check shape

embeddings = np.array(embeddings) print(f"Embeddings shape: {embeddings.shape}")

4. Dimensionality reduction to 2D using PCA, t-SNE, and UMAP

pca = PCA(n_components=2).fit_transform(embeddings) tsne = TSNE(n_components=2, perplexity=5, learning_rate=1, n_iter=5000).fit_transform(embeddings) # 设置较高的迭代次数 umap_result = umap.UMAP(n_neighbors=15, n_components=2).fit_transform(embeddings)

5. Generate a projections list and convert all values ​​to standard float type

projections = [] for i, word in enumerate(vocabulary): if word in model: projections.append({ "word": word, "pca-0": float(pca[i][0]), "pca-1": float(pca[i][1]), "tsne-0": float(tsne[i][0]), "tsne-1": float(tsne[i][1]), "umap-0": float(umap_result[i][0]), "umap-1": float(umap_result[i][1]) })

6. Define bookmark.json configuration

bookmark_config = { "label": "State 0", "isSelected": True, "tSNEIteration": 5000, # Set a higher number of iterations "tSNEPerplexity": 5, "tSNELearningRate": 1, "tSNEis3d": False, # 2d "umapIs3d": False, "umapNeighbors": 15, "projections": projections, "selectedProjection": "umap", "dataSetDimensions": [len(vocabulary), embeddings.shape[1]], "cameraDef": { "orthographic": True, "position": [0, 0, 10], # Set an initial position more suitable for 2D view "target": [0, 0, 0], "zoom": 1.2 # Set a zoom that is more suitable for 2D view }, "selectedColorOptionName": "category", "selectedLabelOption": "word" }

7. Output bookmark.json

with open("bookmark.json", "w") as json_file: json.dump(bookmark_config, json_file, indent=4)



3.Screen all the title and abstract
![image](https://github.com/user-attachments/assets/f45a83f3-1786-4e95-929c-1c49076ee303)

Tasks are still ongoing?
- 1.After getting "voc.txt", "embedding_vec.emb" and "bookmark.json", using them to data analytics and get visualization, such as N-gram graph networks and heatmaps, Generate hierarchical clustering dendrograms and heatmaps,and Generate a heatmap of cosine similarity relationships across categories

- 2.make a workflow diagram of training word2vec model and get bookmark.json.

**Challenges and Learning**

What was the biggest challenge you faced this week? How did you address it?
First, as for training word2vec and base on it, getting bookmark.json, for me, is a big challenge, I learn from some excellent example from github to write my own code. Finally, I succeeded.

Second, some problem was produced when I screening the title and abstact. 
- Are articles that mention both mental health and physical health included? Although we are conducting a literature review with mental health as the outcome, these types of articles also have mental health sections that can be referenced?
- For mental health/wellbeing outcomes, could terms like “job satisfaction”, “satisfaction”, “stress” be classified under wellbeing without specifically mentioning the term “wellbeing”?
- There are very narrow built environments discussed in some of the articles screened like mining camps, would those be included?
- Transportation (Reducing stress from commuting or causing traffic-related stress). Is it excluded?

What's one new thing you learned or skill you improved?
learn from excellent code to fix my code problem, which is very important part for debugging and learing.

**Reflection and Planning**

Main focus for next week?

1. Data analytics and get visualization, such as N-gram graph networks and heatmaps.
2. Upload  full text pdfs of the title and abstract with "yes" selected

Any resources you're looking for?
Not yet.
changdama commented 9 hours ago

Week9

Quick Overview List your top 3 tasks or objectives for this week:

Task 1: Upload full text pdfs of the title and abstract with "yes" selected. Task 2: Review the irrelevant studies one more time, (especially older adults and children but don't give age range, select”maybe”) Task 3: Finish the visualization of N-gram graph networks and heatmaps.

Weekly Accomplishments

What tasks did you complete this week? (Include links)

  1. Upload full text pdfs of the title and abstract with "yes" selected, but it's difficult to look for 3 papers

2.Review the irrelevant studies one more time, mainly focus on:

3.Make sure the number of full-text review we have to screen finally. image

4.Finish the visualization of N-gram graph networks and heatmaps.

#import modules

from gensim.models import KeyedVectors
import gensim
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
import copy
import random
from matplotlib import cm
import matplotlib as mpl
import json

#data preprocessing
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import normalize, scale, MinMaxScaler

#clustering modules
from sklearn.datasets import make_blobs
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN

%matplotlib inline

#Graph embeddings visualization 
from adjustText import adjust_text
#https://stackoverflow.com/questions/19073683/matplotlib-overlapping-annotations-text

#utilities
from IPython.display import display, HTML
import warnings
warnings.filterwarnings('ignore')
#data directories
embedding_vector_file = "/home/changda/musi6204/vip/embedding_vec.emb"
vocabulary_file = "/home/changda/musi6204/vip/vocabulary.txt"
bookmark_file = "/home/changda/musi6204/vip/bookmark.json"
mental_health = [
    'health','healthy','healthcare','healthartifact','hvachealth','healththe',
    'wellbeing', 'selfwellbeing','mental', 'stress', 'sleep', 'depression',
    'psychological', 'symptoms', 'anxiety', 'care', 'disorders',
    'stroke', 'disease', 'cardiovascular','exercise','behavior','feelings',
    'physical', 'activity', 'life', 'perceived', 'risk', 'needs',
    'healthy', 'attention', 'experience', 'conditions', 'cognitive',
    'comfort', 'positive', 'subjective', 'perception', 'satisfaction','fatigue','work','restoration',
    'beneficial','vulnerable','deprivation','stressors','medical','engagement','support', 'issues'
] 
population = ['adolescents', 'adults','participants', 'children', 'older', 'adulthood', 'caregivers',
              'students', 'patient', 'individual',  'poor',
       ]

urban = [
    'urban', 'city', 'cities', 'environment', 'environments', 'built',
    'areas', 'spatial', 'design', 'planning', 'development', 'climate',
    'community', 'communities', 'spaces', 'natural', 'residents',
    'building', 'buildings', 'public', 'transport', 'streets',
    'mobility', 'sustainable', 'residential', 'greenspace', 'parks',
    'greenery', 'green', 'neighborhood', 'neighbourhood',
    'infrastructure', 'urbanization', 'density', 'land', 'regions',
    'noise', 'pollution', 'ventilation', 'local', 'housing', 'thermal',
    'cycling', 'road', 'soundscapes', 'air', 'greenness', 'water',
    'rural', 'wildlife', 'sound', 'equity', 'crime', 'justice',
    'accessibility', 'flood', 'nature', 'trees', 'temperature',
    'occupants', 'space', 'campus', 'layout', 'private', 'access',
    'metro', 'citizens', 'walking', 'travel', 'safety'
]

data_science = [
    'data', 'analysis', 'model', 'models', 'machine', 'learning',
    'variables', 'algorithm', 'algorithms', 'simulation', 'mapping',
    'regression', 'spatiotemporal', 'quantitative', 'database',
    'geospatial', 'techniques', 'methods', 'approach', 'framework',
    'significant', 'effects', 'results', 'sample', 'correlation',
    'significantly', 'modeling', 'survey', 'questionnaire', 'p', 'ci',
    'statistics', 'parameters', 'baseline', 'distribution', 'mean',
    'system', 'function', 'evaluation', 'coefficients', 'probability',
    'classification', 'processing', 'metrics',
    'collected', 'assessment', 'interviews', 'characteristics',
    'tools', 'measurement', 'experimental','intervention'
]

data = [
    'data', 'results', 'study', 'survey', 'information', 'measurements',
    'analysis', 'findings', 'statistics', 'number', 'sample',
    'database', 'datasets', 'processing',  'time',
    'years', 'participants', 'respondents', 'reports','outcomes',
     'phase', 'phases', 'transition', 'transitions', 'stages', 'stage'
]
model = gensim.models.KeyedVectors.load_word2vec_format(fname=embedding_vector_file, unicode_errors='strict')
def create_color_bar(min_v = 0, max_v = 5.0, color_map = cm.Reds, bounds = range(6)):
    fig, ax = plt.subplots(figsize=(6, 1))
    fig.subplots_adjust(bottom=0.5)

    norm = mpl.colors.Normalize(min_v, max_v)
    if bounds!= None:
        cb1 = mpl.colorbar.ColorbarBase(ax, cmap=color_map,
                                        norm=norm,
                                        boundaries = bounds,
                                        orientation='horizontal')
    else:
        cb1 = mpl.colorbar.ColorbarBase(ax, cmap=color_map,
                                        norm=norm,
                                        orientation='horizontal')

    cb1.set_label('relation_strength')
    fig.show()
    plt.show()
    display(HTML("<hr>"))

}

N-gram Heatmap Visualization

1. initialize network graph

G = nx.Graph() display(HTML('''

Usability ranked heatmap plots using N-gram method

<p>Please refer to Figures 8, 9, and 13 in the article</p>
<b>Colorbar indicating the range of relations strength:</b><br>
0 = very weak relation, 1.0 = very strong relation

''')) create_color_bar()

Iterating Through Relationship Data

for k2, rel in relations.items(): table = {}

for ds in rel[1]:
    graph = [G.add_edge(ds, node_, weight=1.0) for node_ in ds]
    #Computing Word Pair Similarities
    d = model.most_similar(ds, topn=1000000)
    this_one = {}

Building Word Similarity Table

    for j in rel[0]:
        ef_sim = []
        for i in d:
            if i[0] in j:
                ef_sim.append(i[1])
        # Check if ef_sim is not empty before taking the max
        if ef_sim:
            this_one[j] = np.max(ef_sim) 
        else:
            this_one[j] = 0  # Assign a default value if ef_sim is empty
    table[ds] = this_one  

table2 = {}
for k, v in table.items():
    ut = []
    for d in rel[0]:
        graph2 = [G.add_edge(d, node_2, weight=1.0) for node_2 in d]
        try:
        #Filtering and Graph Construction
            if v[d] > 0.05:
                ut.append(v[d])
                G.add_edge(d, k, weight=v[d])
            else:
                ut.append(np.nan)
        except Exception as e:
            print(str(e))
            ut.append(np.nan)
    table2[k] = ut

table2["index"] = rel[0]  

df = pd.DataFrame(table2)
df = df.set_index("index")
#Data Output
df.to_csv(f"./cc/{k2}.csv")
df.to_csv("./cc/" + k2 + ".csv")
viridis = cm.get_cmap('Reds', 5)

#Sorting Data and Plotting Heatmaps
df_sorted = df.reindex(df.sum().sort_values(ascending=False).index, axis=1)
df_sorted['sum'] = df_sorted.sum(axis=1)

# Adjust the figure size to accommodate labels
plt.figure(figsize=(max(12, float(len(rel[0]) * 0.7)), max(12, float(len(rel[1]) * 0.7))))
df_sorted = df_sorted.sort_values(by="sum", ascending=False)[df_sorted.columns[:-1]]

# Create the heatmap with label adjustments
ax = sns.heatmap(
    df_sorted, 
    cmap="Reds",
    vmin=0.0, 
    vmax=0.5,
    square=True, 
    annot=False, 
    linewidths=0.1, 
    linecolor="#fff",
    cbar=False,
    xticklabels=True,  # Ensure x-axis labels are shown
    yticklabels=True   # Ensure y-axis labels are shown
)

# Adjust x and y labels for readability
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right', fontsize=8)
ax.set_yticklabels(ax.get_yticklabels(), fontsize=7)

# Set plot title
plt.title(k2)

plt.savefig(f"./hcc/{k2}_clusterd.svg")

# Display the heatmap
plt.show()

![data_science-data_clusterd](https://github.com/user-attachments/assets/6d1782dc-63f6-4978-8190-e6df629d62d2)
![data_science-population_clusterd](https://github.com/user-attachments/assets/fcb28f13-d727-4afb-90da-7c6895b38c7c)
![mental_health-data_clusterd](https://github.com/user-attachments/assets/a6e52788-ac30-4f25-9e41-69760f511928)
![mental_health-data_science_clusterd](https://github.com/user-attachments/assets/d7f1c20d-c163-4dc9-a859-c1732473b33c)
![population-mental_health _clusterd](https://github.com/user-attachments/assets/e8216b4f-77e3-4433-bdf2-92e7058606b9)
![urban-data_clusterd](https://github.com/user-attachments/assets/27336dff-60bd-4988-aae4-d893df0ed535)
![urban-data_science_clusterd](https://github.com/user-attachments/assets/82a6d19b-5a23-45a7-ab0c-8b41ed0d4975)
![urban-mental_health_clusterd](https://github.com/user-attachments/assets/f4fea6a9-92d3-4f8f-9497-7eb041ad6e33)
![urban-population_clusterd](https://github.com/user-attachments/assets/008f3234-83df-41d9-987e-85663b2b4e1f)

Tasks are still ongoing?

- Designing workflow diagram of this part.(the relationship between different categories and organize different realtionship of heatmaps)

**Challenges and Learning**

What was the biggest challenge you faced this week? How did you address it?
The biggest challenge is to categorize the keywords in “vocabulaty.txt”. Due to the huge amount of data, I finally used the filtering function of Word to achieve categorization. For the visualization part, I spent a long time debugging the sorting and Filtering and Graph Construction parts, but with the help of the reference, I was able to debug them all.

What's one new thing you learned or skill you improved?
Learned the construction of data analysis based on N-gram similarity and visualized it.

**Reflection and Planning**

Main focus for next week?

1.Finish Heirarchical Clustereing (HAC) and Correlation Matrics visualizaton
2.Review full-text at least 100

Any resources you're looking for?
Not yer
changdama commented 8 hours ago

Week10

Quick Overview List your top 3 tasks or objectives for this week:

Task 1: Reviewing full-text at least 100 Task 2: Finish Heirarchical Clustereing (HAC) and Correlation Matrics visualizaton

Weekly Accomplishments

What tasks did you complete this week? (Include links)

  1. Review at least 100 full-text and raise some questions image

2.Finish Heirarchical Clustereing (HAC) and Correlation Matrics visualizaton First, word2vec data preprocessing:extraction, classification and storage of word embedding vectors.

# Display table header

display(HTML('''<b> The following table shows each word, its corresponding 300-dimension vector, and its category</b>'''))

# Initialize dictionary for DataFrame
dfdict = {"word": []}
for i in range(1, 301):
    dfdict[i] = []

# Process each word, checking if it exists in the model's vocabulary
for i in list(G.nodes):
    dfdict["word"].append(i)

    if i in model:  # Check if the word exists in the model's vocabulary
        thisvec = list(model.get_vector(i))
    else:
        thisvec = [0] * 300  # Use a zero vector if the word is not in the vocabulary

    # Ensure vector is 300 dimensions
    if len(thisvec) < 300:
        thisvec += [0] * (300 - len(thisvec))
    elif len(thisvec) > 300:
        thisvec = thisvec[:300]

    for ix in range(300):
        dfdict[ix + 1].append(thisvec[ix])

# Convert dictionary to DataFrame and save as TSV
embd = pd.DataFrame(dfdict).set_index("word")
embd.to_csv("../embedding_matrix.tsv", sep="\t", index=False)

# Define function to assign category
def return_type(x):
    categories = {
        "mental_health": mental_health,
        "urban": urban,
        "data": data,
        "population": population,
        "data_science": data_science
    }
    for k, v in categories.items():
        if x in v:
            return k
    return None  # Return None if no category matches

# Apply category assignment
embd = embd.reset_index()
embd["category"] = embd["word"].apply(lambda x: return_type(x))
embd = embd.set_index("word")

Second, load data, perform clustering, visualize

#load all the realtion_dataframes

files = os.listdir("./cc/")
all_dfs = {}
for f in files:
    if ".csv" in f:
        all_dfs[f.replace(".csv", "")]=(pd.read_csv("./cc/"+f))

import matplotlib.patches as mpatches

# Display color bar for correlation values
display(HTML('''<b>Colorbar indicating the correlation value</b><br>
                0.0 = weak correlation, 1.0 = strong correlation'''))
create_color_bar(min_v = 0.0, max_v =1.0)

# initialize the first graph figure
figN = 1

for key, df in all_dfs.items():
    display(HTML(f"{figN}- <b>{key.split('-')[1]} category</b> hvc clustering."))

    # Set index and prepare data for clustering
    df = df.set_index("index")
    wordsHere = list(df.columns)
    words_vectors = [model.get_vector(i) for i in wordsHere]
    words_vec_df = pd.DataFrame({"X": words_vectors, "y": wordsHere})
    X = [list(x) for x in words_vec_df['X']]
    labels_2 = list(words_vec_df['y'])

    # Compute linkage and set color threshold for clusters
    Z = sch.linkage(X, method="ward")
    color_threshold = 0.8 * max(Z[:, 2])  # Adjust threshold for colors

    # Plot dendrogram with colors
    fig, ax = plt.subplots(figsize=(len(X)/5.0, 1))
    dendrogram = sch.dendrogram(
        Z,
        color_threshold=color_threshold,
        labels=labels_2,
        orientation='top',
        ax=ax,
        leaf_rotation=90
    )
    ax.tick_params(axis='x', which='major', labelsize=10)
    ax.tick_params(axis='y', which='major', labelsize=8)

    # Generate legend based on unique colors in dendrogram
    color_list = dendrogram['color_list']
    unique_colors = list(set(color_list))
    patches = [mpatches.Patch(color=color, label=f'Cluster {i+1}') for i, color in enumerate(unique_colors)]
    plt.legend(handles=patches, bbox_to_anchor=(1.05, 1), loc='upper left', title="Clusters")

    # Save dendrogram
    plt.savefig(f"./hvc/{key}_clusterd_with_legend.svg", bbox_inches='tight')
    plt.show()

    # Calculate correlation matrix and plot heatmap
    df = df.drop_duplicates()
    dd = df[dendrogram["ivl"]]  # Reorder columns according to the dendrogram order
    corr = dd.corr()
    mask = np.triu(np.ones_like(corr, dtype=bool))

    plt.figure(figsize=(len(X)/1.0, len(X)/1.5), dpi=300)
    plt.title(f"{key.split('-')[1]} < -- > {key.split('-')[1]}")
    sns.heatmap(
        corr,
        cmap="Reds",
        vmin=0.0,
        vmax=1.0,
        cbar=False,
        square=True,
        linewidth=0.5
    )

    # Save reordered dataframe and correlation matrix
    dd.to_csv(f"./cd/{key}.csv")
    plt.savefig(f"./Correlation_matrix/{key}.svg")
    plt.show()

    figN += 1

image image image image data data_science mental_health data_science-population

Tasks are still ongoing? Design and organize the relationship between Heirarchical Clustereing (HAC) and Correlation Matrics visualizaton Continue to focusing on the full-text review.

Challenges and Learning

What was the biggest challenge you faced this week? How did you address it?

  1. When I run the code,because the Differences in data arrangement, numerical calculation details, or plotting parameters may still cause the image results to look different.
    
    Z = sch.linkage(X, method="ward")
    dd = df[dendrogram["ivl"]]
    corr = dd.corr()

The leaf order is not determined in the code and the consistency of the random seed leads to different correlation heatmaps

What's one new thing you learned or skill you improved?
Learn methiod of Heirarchical Clustereing (HAC) and Correlation Metrics

**Reflection and Planning**

Main focus for next week?

1.Finish full-text review 
2.Finish Cross realtion between different categories

Any resources you're looking for?
Not yet
changdama commented 8 hours ago

Week10

Quick Overview List your top 3 tasks or objectives for this week:

Task 1: Solve the problem of inconsistent or unclear descriptions of the population age in the literature during the screening process. Task 2: Finish the full text reivew Task 3: Complete the visualization of Cross realtion between different categories

Weekly Accomplishments

What tasks did you complete this week? (Include links)

  1. Solve the problem of inconsistent or unclear descriptions of the population age in the literature during the screening process. I summarize different type of age problem of Participants, and conclude the solution of tackling these.

2.Complete the full-text review screening image

3.Complete the visualization of Cross realtion between different categories

display(HTML('''<b>Colorbar indicating the cross realtion values between each two categories</b><br>
                <p>These values are extracted from the cosine similarity metric </p>
                0.0 = weak correlation, 0.5 = strong correlation'''))

all_orders={}
create_color_bar(min_v = 0.0, max_v = 0.5, bounds=list(np.array(range(6))/10.))
figN = 1
for key, dd in all_dfs.items():
    display(HTML(str(figN) + "- "+ key.replace("-", "< -- >")))
    dd2 = pd.read_csv("./cd/"+key+".csv").set_index("index")
    key_part = key.split("-")[0]

    if key_part in all_orders:
        dd2 = dd2.reindex(all_orders[key_part])
    else:
        print(f"Warning: '{key_part}' not found in all_orders, skipping reindexing.")

    plt.figure(figsize=(len(dd2.columns)/1.0,len(dd2.index)/1.0), dpi=300)
    plt.title(key.replace("-", "< -- >"))
    sns.heatmap(dd2,
                cmap="Reds",
                square=True,
                vmin=0.0,
                vmax=0.5,
                linewidth=0.5,
                cbar=False
               )
    plt.savefig(f"./Cross_realtion_matrix/{key}_cross_rel.svg")
    plt.show()
    figN +=1

data_science-data_cross_rel data_science-population_cross_rel mental_health-data_cross_rel mental_health-data_science_cross_rel population-mental_health _cross_rel urban-data_cross_rel urban-data_science_cross_rel urban-mental_health_cross_rel urban-population_cross_rel

Tasks are still ongoing? Reorganzie the relationship between cross corelation heatmap and HAC.heatmap as ar final result.

Challenges and Learning

What was the biggest challenge you faced this week? How did you address it?

Explain the difference between N-gram similarity and cross correlation. In other words, if the N-gram has already concluded a similarity analysis between different relations, why analyze cross correlation? The difference between the two is that N-gram is based on the word vector itself, that is, frequency statistics and probability calculations to derive the similarity, while cross correlation is based on mathematical statistical methods (cosine similarity) to calculate.

What's one new thing you learned or skill you improved? Using cosine similarity method to visualize the relationship between diferent relationship Understanding the difference between cross corelation and N-gram simiarity

Reflection and Planning

Main focus for next week?

1Making Data extaction templete draft and draft the structure of "Introduction" and "method" part 2.Complete Word embeddings 2d projection visualization

Any resources you're looking for? Not yet

changdama commented 7 hours ago

Week11

Quick Overview List your top 3 tasks or objectives for this week:

Task 1: Complete the draft of data extraction part Task 2: Complete Word embeddings 2d projection visualization Task 3: Complete the structure of "Introduction" and "method" part

Weekly Accomplishments

What tasks did you complete this week? (Include links)

  1. Complete the draft of data extraction part

2.Draft the structure of"Introduction" and "Method" part https://gtvault-my.sharepoint.com/:w:/g/personal/cma326_gatech_edu/EQWfDwRri6hNlsvWU0vcVVcB9ARsGi1TS0aJ9umymZSE4A?e=UiZzhy

3.Complete Word embeddings 2d projection visualization

# Load the bookmark JSON
with open("/home/changda/musi6204/vip/bookmark.json", 'r') as bm:
    bookmark = json.loads(bm.read())

# Load the TSV file, assuming the first column is "word" and the second column is "category", followed by embedding values
embedding_columns = ["word", "category"] + [f"embedding_{i}" for i in range(1, 101)]
embd = pd.read_csv("labels.tsv", sep='\t', header=None, names=embedding_columns)

# Access projections directly under 'root'
word_pos = pd.DataFrame(bookmark['projections'])
word_pos["word"] = embd.reset_index()["word"]  # Ensure index reset and assign words
word_pos = word_pos.set_index("word")

# Check for duplicate index values in both DataFrames
if embd["word"].duplicated().any():
    print(f"Duplicate words found in embd: {embd['word'].duplicated().sum()} duplicates")
    embd = embd.drop_duplicates(subset="word")  # Drop duplicates in embd

if word_pos.index.duplicated().any():
    print(f"Duplicate words found in word_pos: {word_pos.index.duplicated().sum()} duplicates")
    word_pos = word_pos[~word_pos.index.duplicated(keep="first")]  # Drop duplicates in word_pos

# Align indexes between embd and word_pos
common_index = pd.Index(embd["word"].astype(str)).intersection(word_pos.index)

# Update embd and word_pos to only keep common words
embd = embd.set_index("word").loc[common_index]  # Keep only common words in embd
word_pos = word_pos.loc[common_index]  # Keep only common words in word_pos

# Combine embeddings and projections
embd_with_word_pos = pd.concat([embd, word_pos], axis=1)

# Process data for visualization
# Generate default x and y for visualization if 'umap-0' and 'umap-1' don't exist
if "umap-0" not in embd_with_word_pos.columns or "umap-1" not in embd_with_word_pos.columns:
    embd_with_word_pos["umap-0"] = embd_with_word_pos.iloc[:, 2]  # Example: Use the third column as x
    embd_with_word_pos["umap-1"] = embd_with_word_pos.iloc[:, 3]  # Example: Use the fourth column as y

# Prepare data for visualization
xy = embd_with_word_pos[["umap-0", "umap-1", "category"]].rename({"umap-0": "x", "umap-1": "y"}, axis=1)

# Map categories to colors
category_to_color = {"mental_health": "#F15A22", "population": "#6DC8BF", "urban": "#B72467","data science": "#CBDB2A","data": "#FFA07A",}
xy["color_p"] = xy["category"].map(category_to_color).fillna("#000000")  # Default to black if category is not in palette

# Resulting DataFrame `xy` is ready
print(xy.head())
G = nx.Graph()
for node in xy.index:
    G.add_node(node)
for i, node1 in enumerate(xy.index):
    for node2 in xy.index[i+1:]:
        G.add_edge(node1, node2)    
node_degrees = nx.degree(G)
nx.set_node_attributes(G, "degree", node_degrees)

graph_colors = xy[["color_p"]].to_dict()["color_p"]
xy["pos"] = xy.apply(lambda x : (x["x"], x["y"]), axis =1)
graph_pos = xy["pos"].to_dict()
···

- Extract information about the nodes in the data and label the nodes in the main vocabulary.
```# Get a list of words containing each category from the xy DataFrame
population = xy[xy["category"] == "population"].index.tolist()
data = xy[xy["category"] == "data"].index.tolist()
data_science = xy[xy["category"] == "data science"].index.tolist()
mental_health = xy[xy["category"] == "mental_health"].index.tolist()
urban = xy[xy["category"] == "urban"].index.tolist()

# List of words from all categories
all_main_words = set(population + data + data_science + mental_health + urban)

# Create the main_words list and check whether the nodes belong to the main vocabulary.
main_words = []
for k, v in G.degree():
    if k in all_main_words:
        main_words.append(k)
    else:
        main_words.append("")

Set the distance threshold

distance_threshold = 0.25

Calculate the distance between all nodes

positions = xy[["x", "y"]].values dist_matrix = distance_matrix(positions, positions)

Get the index of the node pair. Those that meet the distance threshold conditions will be connected.

edges = np.argwhere((dist_matrix < distance_threshold) & (dist_matrix > 0))

Obtain information about the node degree

degree = dict(G.degree())

draw a graph

plt.figure(figsize=(20, 15), dpi=300) nx.draw_networkx_nodes( G, pos=graph_pos, node_color=[v for k, v in graph_colors.items()], node_size=[v * 1.3 for k, v in degree.items()], alpha=0.6 )

Draw edges that meet the distance threshold

for i, j in edges: plt.plot( [positions[i][0], positions[j][0]], [positions[i][1], positions[j][1]], "k-", alpha=0.2, linewidth=0.8
)

Add label text

texts = [] for indx, i in enumerate(main_words[:]): if i != "": texts.append(plt.text(xy.reset_index().loc[indx]["x"], xy.reset_index().loc[indx]["y"], i)) adjust_text(texts, only_move={'texts': 'x'}, arrowprops=dict(arrowstyle="-", color='k', lw=0.7))

save image

plt.savefig(".graph_embeddings_projection.svg") plt.show()

![graph_embeddings_projection_with_edges](https://github.com/user-attachments/assets/f3dc6a29-d8b6-464e-876f-e5487b958176)

visualization: option2 based on distance_threshold and partition

import community as community_louvain

Set the distance threshold

distance_threshold = 0.25

Get node loc

positions = xy[["x", "y"]].values dist_matrix = distance_matrix(positions, positions)

Create graph and add node

graph = nx.Graph() graph.add_nodes_from(range(len(positions)))

Add edges that meet the distance requirements

edges = np.argwhere((dist_matrix < distance_threshold) & (dist_matrix > 0)) graph.add_edges_from(edges)

Louvain

partition = community_louvain.best_partition(graph)

get different num of Louvain

num_communities = max(partition.values()) + 1 color_map = cm.get_cmap('tab20', num_communities)

give different color to every partition

node_colors = [color_map(partition[node]) for node in graph.nodes()]

using spring_layout

pos = nx.spring_layout(graph, k=0.1, seed=42)

Graph

plt.figure(figsize=(20, 15), dpi=300)

nodes

nx.draw_networkx_nodes( graph, pos=pos, node_color=node_colors, node_size=100, alpha=0.8 )

Edges

nx.draw_networkx_edges( graph, pos=pos, edgelist=edges, edge_color='grey', alpha=0.3, width=0.5 )

Label

texts = [] for indx, word in enumerate(main_words): if word != "": x, y = pos[indx] texts.append(plt.text(x, y, word, fontsize=8)) adjust_text(texts, only_move={'texts': 'xy'}, arrowprops=dict(arrowstyle="-", color='k', lw=0.5))

Save image

plt.savefig("./graph_embeddings_projection_with_communities.svg") plt.show()


![graph_embeddings_projection_with_communities](https://github.com/user-attachments/assets/0bb7fb71-24f4-4e01-926a-498267905c24)

Tasks are still ongoing?
Replete with some detail of data extraction templete draft.
Improve more details of the  structure of "Introduction" and "method" part

**Challenges and Learning**

What was the biggest challenge you faced this week? How did you address it?
As for word vector projection, I don't know about that. So I learn from the reference:https://github.com/ideas-lab-nus/data-science-bldg-energy-efficiency 

What's one new thing you learned or skill you improved?
Understand how to use word2vec to visualize profection.