Currently, it extracts files to a flat directory structure.

Use the original file/path structure. The tricky part here is how the links are updated. We have to keep track of the depth for the links. This is a super low priority tasks.

We probably want to create a new program that does this instead of trying to combine all of the diverse features into one.

Hierarchical Export Specification

Overview

This utility enhances the current export_watcher.py functionality by preserving Notion's hierarchical file/directory structure while cleaning filenames and maintaining proper internal markdown link relationships.

Core File Processing Rules

Filename Cleaning

All files must follow these standardized cleaning rules from export_watcher.py:

Remove date patterns:

# Remove patterns like "10 24 2024 - " from start of filename
filename = re.sub(r'\d{2}[\s_]+\d{2}[\s_]+\d{4}[\s_]*-[\s_]*', '', filename)

Remove GUID patterns:

# Remove 32-character hex strings from end of filename
filename = re.sub(r'\s+[a-f0-9]{32}$', '', filename)

Standardize naming:
- Replace spaces with underscores
- Remove double underscores
- Remove trailing underscores
- Preserve original file extension (.md)

Directory Structure

Input Structure:

data/
├── 10 24 2024 - Root Doc abc123.md
├── 10 24 2024 - Folder One def456/
│   ├── 10 24 2024 - Doc One ghi789.md
│   └── 10 24 2024 - Subfolder jkl012/
│       └── 10 24 2024 - Doc Two mno345.md
└── 10 24 2024 - Folder Two pqr678/
    └── 10 24 2024 - Doc Three stu901.md

Output Structure:

output/hierarchical/
├── Root_Doc.md
├── Folder_One/
│   ├── Doc_One.md
│   └── Subfolder/
│       └── Doc_Two.md
└── Folder_Two/
    └── Doc_Three.md

Core Components

1. DirectoryProcessor

class DirectoryProcessor:
    def __init__(self, input_dir, output_dir):
        self.input_dir = input_dir
        self.output_dir = output_dir
        self.filename_mapping = {}  # Maps original paths to new paths
        self.directory_mapping = {} # Maps original dirs to new dirs

    def clean_directory_name(self, dirname):
        # Remove date pattern
        dirname = re.sub(r'\d{2}[\s_]+\d{2}[\s_]+\d{4}[\s_]*-[\s_]*', '', dirname)
        # Remove GUID pattern
        dirname = re.sub(r'\s+[a-f0-9]{32}$', '', dirname)
        # Standardize format
        return dirname.strip().replace(' ', '_')

    def process_directory_structure(self):
        """Creates cleaned directory structure and builds mapping"""

2. PathResolver

class PathResolver:
    def __init__(self, filename_mapping, directory_mapping):
        self.filename_mapping = filename_mapping
        self.directory_mapping = directory_mapping

    def get_relative_path(self, source_file, target_file):
        """Calculate relative path between two files in hierarchy"""

    def update_markdown_links(self, content, current_file_path):
        """Update all markdown links in content based on new paths"""

3. FileProcessor

class FileProcessor:
    def __init__(self, directory_processor, path_resolver):
        self.directory_processor = directory_processor
        self.path_resolver = path_resolver

    def process_file(self, input_path, output_path):
        """Process single file - clean name and update links"""

    def process_all_files(self):
        """Process all files while maintaining hierarchy"""

Processing Flow

Directory Structure Creation
- Scan input directory recursively
- Clean directory names using rules above
- Create output directory structure
- Build directory mapping
Initial File Processing
- Clean filenames using export_watcher.py rules
- Copy files to new locations
- Build comprehensive path mapping
- Preserve file metadata
Link Resolution
- Parse markdown files for links
- Calculate new relative paths
- Update links using new paths
- Validate updated links

Link Processing Examples

Same Directory:

Original: [Link](10 24 2024 - Doc Two abc123.md)
Updated:  [Link](Doc_Two.md)

Child Directory:

Original: [Link](10 24 2024 - Subfolder def456/10 24 2024 - Doc Three ghi789.md)
Updated:  [Link](Subfolder/Doc_Three.md)

Parent Directory:

Original: [Link](../10 24 2024 - Root Doc jkl012.md)
Updated:  [Link](../Root_Doc.md)

Error Handling

File Operation Errors

try:
shutil.copy2(src_path, dst_path)
except PermissionError:
logging.error(f"Permission denied: {src_path}")
except FileNotFoundError:
logging.error(f"Source file not found: {src_path}")

Link Resolution Errors

def validate_link(self, link_path):
if not os.path.exists(link_path):
    logging.warning(f"Broken link detected: {link_path}")
return link_path

Configuration

class Config:
    def __init__(self):
        self.input_dir = "data"
        self.output_dir = "output/hierarchical"
        self.preserve_timestamps = True
        self.logging_level = logging.INFO
        self.validate_links = True

Logging

Using export_watcher.py logging format:

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('hierarchical_export.log'),
        logging.StreamHandler()
    ]
)

Usage

def main():
    try:
        processor = HierarchicalProcessor()
        processor.process_directory_structure()
        processor.process_files()
        processor.validate_output()
        logging.info("Hierarchical export completed successfully")
    except Exception as e:
        logging.error(f"Export failed: {e}")
        return 1
    return 0

Future Enhancements

Performance Optimizations
- Parallel directory processing
- Batch file operations
- Link cache management
Additional Features
- Generate structure visualization
- Export structure report
- Link validation report
- Broken link detection
Integration Options
- ZIP file support
- Multiple export format support
- Custom naming rules
- Structure templates

This specification maintains compatibility with the existing export_watcher.py functionality while adding hierarchical structure preservation and enhanced link management.

RichardHightower / notion_extractor

Extract using original path hiearchy #3