RichardHightower / notion_extractor

Extracts notions zip file of markdown into normal markdown files.
0 stars 0 forks source link

Extract using original path hiearchy #3

Open RichardHightower opened 3 weeks ago

RichardHightower commented 3 weeks ago

Currently, it extracts files to a flat directory structure.

Use the original file/path structure. The tricky part here is how the links are updated. We have to keep track of the depth for the links. This is a super low priority tasks.

We probably want to create a new program that does this instead of trying to combine all of the diverse features into one.


Hierarchical Export Specification

Overview

This utility enhances the current export_watcher.py functionality by preserving Notion's hierarchical file/directory structure while cleaning filenames and maintaining proper internal markdown link relationships.

Core File Processing Rules

Filename Cleaning

All files must follow these standardized cleaning rules from export_watcher.py:

  1. Remove date patterns:

    # Remove patterns like "10 24 2024 - " from start of filename
    filename = re.sub(r'\d{2}[\s_]+\d{2}[\s_]+\d{4}[\s_]*-[\s_]*', '', filename)
  2. Remove GUID patterns:

    # Remove 32-character hex strings from end of filename
    filename = re.sub(r'\s+[a-f0-9]{32}$', '', filename)
  3. Standardize naming:

    • Replace spaces with underscores
    • Remove double underscores
    • Remove trailing underscores
    • Preserve original file extension (.md)

Directory Structure

Input Structure:

data/
├── 10 24 2024 - Root Doc abc123.md
├── 10 24 2024 - Folder One def456/
│   ├── 10 24 2024 - Doc One ghi789.md
│   └── 10 24 2024 - Subfolder jkl012/
│       └── 10 24 2024 - Doc Two mno345.md
└── 10 24 2024 - Folder Two pqr678/
    └── 10 24 2024 - Doc Three stu901.md

Output Structure:

output/hierarchical/
├── Root_Doc.md
├── Folder_One/
│   ├── Doc_One.md
│   └── Subfolder/
│       └── Doc_Two.md
└── Folder_Two/
    └── Doc_Three.md

Core Components

1. DirectoryProcessor

class DirectoryProcessor:
    def __init__(self, input_dir, output_dir):
        self.input_dir = input_dir
        self.output_dir = output_dir
        self.filename_mapping = {}  # Maps original paths to new paths
        self.directory_mapping = {} # Maps original dirs to new dirs

    def clean_directory_name(self, dirname):
        # Remove date pattern
        dirname = re.sub(r'\d{2}[\s_]+\d{2}[\s_]+\d{4}[\s_]*-[\s_]*', '', dirname)
        # Remove GUID pattern
        dirname = re.sub(r'\s+[a-f0-9]{32}$', '', dirname)
        # Standardize format
        return dirname.strip().replace(' ', '_')

    def process_directory_structure(self):
        """Creates cleaned directory structure and builds mapping"""

2. PathResolver

class PathResolver:
    def __init__(self, filename_mapping, directory_mapping):
        self.filename_mapping = filename_mapping
        self.directory_mapping = directory_mapping

    def get_relative_path(self, source_file, target_file):
        """Calculate relative path between two files in hierarchy"""

    def update_markdown_links(self, content, current_file_path):
        """Update all markdown links in content based on new paths"""

3. FileProcessor

class FileProcessor:
    def __init__(self, directory_processor, path_resolver):
        self.directory_processor = directory_processor
        self.path_resolver = path_resolver

    def process_file(self, input_path, output_path):
        """Process single file - clean name and update links"""

    def process_all_files(self):
        """Process all files while maintaining hierarchy"""

Processing Flow

  1. Directory Structure Creation

    • Scan input directory recursively
    • Clean directory names using rules above
    • Create output directory structure
    • Build directory mapping
  2. Initial File Processing

    • Clean filenames using export_watcher.py rules
    • Copy files to new locations
    • Build comprehensive path mapping
    • Preserve file metadata
  3. Link Resolution

    • Parse markdown files for links
    • Calculate new relative paths
    • Update links using new paths
    • Validate updated links

Link Processing Examples

  1. Same Directory:

    Original: [Link](10 24 2024 - Doc Two abc123.md)
    Updated:  [Link](Doc_Two.md)
  2. Child Directory:

    Original: [Link](10 24 2024 - Subfolder def456/10 24 2024 - Doc Three ghi789.md)
    Updated:  [Link](Subfolder/Doc_Three.md)
  3. Parent Directory:

    Original: [Link](../10 24 2024 - Root Doc jkl012.md)
    Updated:  [Link](../Root_Doc.md)

Error Handling

  1. File Operation Errors

    try:
    shutil.copy2(src_path, dst_path)
    except PermissionError:
    logging.error(f"Permission denied: {src_path}")
    except FileNotFoundError:
    logging.error(f"Source file not found: {src_path}")
  2. Link Resolution Errors

    def validate_link(self, link_path):
    if not os.path.exists(link_path):
        logging.warning(f"Broken link detected: {link_path}")
    return link_path

Configuration

class Config:
    def __init__(self):
        self.input_dir = "data"
        self.output_dir = "output/hierarchical"
        self.preserve_timestamps = True
        self.logging_level = logging.INFO
        self.validate_links = True

Logging

Using export_watcher.py logging format:

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('hierarchical_export.log'),
        logging.StreamHandler()
    ]
)

Usage

def main():
    try:
        processor = HierarchicalProcessor()
        processor.process_directory_structure()
        processor.process_files()
        processor.validate_output()
        logging.info("Hierarchical export completed successfully")
    except Exception as e:
        logging.error(f"Export failed: {e}")
        return 1
    return 0

Future Enhancements

  1. Performance Optimizations

    • Parallel directory processing
    • Batch file operations
    • Link cache management
  2. Additional Features

    • Generate structure visualization
    • Export structure report
    • Link validation report
    • Broken link detection
  3. Integration Options

    • ZIP file support
    • Multiple export format support
    • Custom naming rules
    • Structure templates

This specification maintains compatibility with the existing export_watcher.py functionality while adding hierarchical structure preservation and enhanced link management.