NFDI4BIOIMAGE / search_engine

search engine for the NFDI4BIOIMAGE materials
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Implement data source for search bar #18

Closed SeverusYixin closed 1 month ago

SeverusYixin commented 1 month ago

The connection between the search engine and the "ymal" database has been initially implemented in this version.

SeverusYixin commented 1 month ago

Hi @haesleinhuepf, would you mind helping me review these codes?

SeverusYixin commented 1 month ago

Hi @SeverusYixin ,

I feel not qualified for reviewing the .js files.

Just two general suggestions:

  • Write a comment here and there, e.g. at the very beginning of index_data.py explaining what the file does, or what a longer code block does.
  • Consider splitting code into functions in case it does multiple things. index_data.py looks a bit like spaghetti code.

Out of curiousity I asked claude to optimize the code and make it less spaghetti-like and this is what it came up with:

import json
import yaml
import logging
from elasticsearch import Elasticsearch
import os

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Configuration
ES_HOST = 'localhost'
ES_PORT = 9200
ES_SCHEME = 'http'
ES_AUTH = ('admin', 'admin123')
ES_INDEX = 'bioimage-training'
BASE_PATH = os.path.join(os.path.dirname(__file__), '..', '..', 'resources')
YAML_FILES = [
    'blog_posts.yml', 'events.yml', 'materials.yml', 'nfdi4bioimage.yml',
    'papers.yml', 'workflow-tools.yml', 'youtube_channels.yml'
]

def connect_to_elasticsearch():
    """Establish connection to Elasticsearch."""
    return Elasticsearch([{'host': ES_HOST, 'port': ES_PORT, 'scheme': ES_SCHEME}],
                         basic_auth=ES_AUTH)

def read_yaml_file(file_path):
    """Read and parse YAML file."""
    try:
        with open(file_path, 'r') as file:
            return yaml.safe_load(file)
    except FileNotFoundError:
        logger.error(f"File not found: {file_path}")
    except yaml.YAMLError:
        logger.error(f"Error reading YAML file: {file_path}")
    return None

def index_data(es, data):
    """Index data into Elasticsearch."""
    if not isinstance(data, list):
        logger.error(f"Data is not a list: {data}")
        return

    for item in data:
        if not isinstance(item, dict):
            logger.error(f"Item is not a dictionary: {item}")
            continue

        try:
            es.index(index=ES_INDEX, body=item)
            logger.info(f"Indexed item: {item}")
        except Exception as e:
            logger.error(f"Error indexing item: {item} - {e}")

def main():
    es = connect_to_elasticsearch()

    for file_name in YAML_FILES:
        file_path = os.path.join(BASE_PATH, file_name)
        logger.info(f"Processing file: {file_path}")

        content = read_yaml_file(file_path)
        if content is None:
            continue

        data = content.get('resources', [])
        logger.info(f"Data read from file: {data}")

        index_data(es, data)

    logger.info("Data indexing complete.")

if __name__ == "__main__":
    main()

I'm not proposing to use this code and I haven't tested it. I just presume that mid-/long-term such code is easier to maintain if it is written in small, well documented, reusable functions.

Best, Robert

That's enough, it will help me standardize my code formatting a bit, pretty thank you :)