Links generated outside/above of the current location are broken

pitrson commented 3 months ago

Hi,

I'm not sure whether this is a feature request or bug, but I have noticed that the links are generated properly only if traversing sub-directories/documents.

This is my test scenario:

docs/
    index.md
    test1/index1.md
    test2/test2_2/index2.md

All of the md docs do have index_test tag set. Now when I try to generate same set of links in each of them using {pagelist 100 index_test }, following's the result:

index.md: links are functional, generated links are:
- http://localhost:8000/test1/index1/
- http://localhost:8000/test2/test2_2/index2/
- http://localhost:8000/ although I'm not sure why such links are generated. These do work, but mkdocs complains and is performing redirects WARNING - [19:22:08] "GET /test2/test2_2/index2 HTTP/1.1" code 302
test1/index1.md:
- http://localhost:8000/test1/ (broken, should be http://localhost:8000/)
- http://localhost:8000/test1/index1/ (works)
- http://localhost:8000/test1/test2/test2_2/index2 (broken, should be http://localhost:8000/test2/test2_2/index2 )
test2/test2_2/index2.md:
- http://localhost:8000/test2/ (broken, should be http://localhost:8000/)
- http://localhost:8000/test2/test1/index1 (broken, should be http://localhost:8000/test1/index1)
- http://localhost:8000/test2/test2_2/index2/ (works)

So I assume that pagelist expects that the generated links/docs are in the subdirectories? IMHO it should be able to generate links for any document regardless of the documentation structure and their location. Eg. we often try to reference docs from different sections/locations - it's quite common I think.

Would this be possible to fix? Thanks!

alanpt commented 3 months ago

Yeah, this was doing me head in the other day. I would get it right if I was in a folder at x deep and then it would break if I was further up or down from that level.

The version before worked at http://localhost:8000/level1/. This current one works at http://localhost:8000/level1/level2. As you say above.

I'm open to some ideas. I could work on this in a few weeks.

pitrson commented 1 month ago

I was able to get this working with my primitive script (I'm not really experienced with Python ^^) based on os.relpath problem described at stackoverflow . I'll try to compose a pull request, but if I'm not able to manage, I'll share the logic, so that you can hopefully implement it: debug to know the files and their tags:

defaultdict(<class 'list'>, {PosixPath('docs/index.md'): ['index_test'], PosixPath('docs/test2/test2_2/index2.md'): ['index_test', 'group_2'], PosixPath('docs/test1/index1.md'): ['index_test', 'group_1'], PosixPath('docs/test1/index2.md'): ['index_test', 'group_1']})

and generated links (I've implemented a check to exclude link generation for 'self' so that the page the links are generated on doesn't include link to itself as it is useless:

Generating links on page docs/index.md with pagelist arguments ['index_test', 'group_1']
Creating link for matched file located at docs/test1/index1.md
Link is test1/index1.md
Creating link for matched file located at docs/test1/index2.md
Link is test1/index2.md

Generating links on page docs/test2/test2_2/index2.md with pagelist arguments ['index_test']
Creating link for matched file located at docs/index.md
Link is ../../index.md
Creating link for matched file located at docs/test1/index1.md
Link is ../../test1/index1.md
Creating link for matched file located at docs/test1/index2.md
Link is ../../test1/index2.md

Generating links on page docs/test1/index1.md with pagelist arguments ['index_test']
Creating link for matched file located at docs/index.md
Link is ../index.md
Creating link for matched file located at docs/test2/test2_2/index2.md
Link is ../test2/test2_2/index2.md
Creating link for matched file located at docs/test1/index2.md
Link is ./index2.md

alanpt commented 1 month ago

Cool thanks @pitrson. If you could do a pull request that would be great. Or share the logic you used. Thanks

pitrson commented 1 month ago

Hey @alanpt

I'm sharing my primitive script, most of it is probably not useful for you, but I hope you can implement the important part which is a realrelpath function and then for loop at the end. I may submit a PR in upcoming weeks if I find some time (or if you're not faster ^^). Just adjust the docsdir to point to your mkdocs docs location to test.

PS. I'm not a python developer.



from collections import defaultdict
import frontmatter
import re
import os

# get all md files
md_files = [ ]
from pathlib import Path
docsdir = 'docs'
pages_tags = defaultdict(list)
pages_pagelist_args = defaultdict(list)

for p in Path( docsdir ).rglob( '*.md' ):
    md_files.append(p)
    data = frontmatter.load(p)
    #populate dict with page[tags]
    for tag in (data['tags']):
      pages_tags[p].append(tag)

print(pages_tags)     

# get files with pagelist
import mmap
pagelist_files = [ ]
for md in md_files:
    with open(md, 'rb', 0) as file, \
        mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
        if s.find(b'pagelist') != -1:
            pagelist_files.append(md)

#get pagelist arguments
for pgfile in pagelist_files:
    f = open(pgfile, 'r')
    pgargs = re.findall(r'\{(pagelist.*?)\}', f.read())
    pgargs = ','.join(pgargs)
    pgargs = list(pgargs.split(" "))
    pgargs.remove('pagelist')
    pgargs = list(filter(None, pgargs))
    pglimit = ([x for x in pgargs if str(x).isdigit()])
    pgtags = ([x for x in pgargs if not str(x).isdigit()])

    for tag in pgtags:
      pages_pagelist_args[pgfile].append(tag)

#https://stackoverflow.com/questions/17506552/python-os-path-relpath-behavior
def realrelpath(origin, dest): 
    '''Get the relative path between two paths, accounting for filepaths'''

    # get the absolute paths so that strings can be compared
    origin = os.path.abspath(origin) 
    dest = os.path.abspath(dest) 

    # find out if the origin and destination are filepaths
    origin_isfile = os.path.isfile(origin)
    dest_isfile = os.path.isfile(dest)

    # if dealing with filepaths, 
    if origin_isfile or dest_isfile:
        # get the base filename
        #changed to dest (as oppsed  to post in stackoverflow)
        filename = os.path.basename(dest) if origin_isfile else os.path.basename(dest)
        # in cases where we're dealing with a file, use only the directory name
        origin = os.path.dirname(origin) if origin_isfile else origin
        dest = os.path.dirname(dest) if dest_isfile else dest 
        # get the relative path between directories, then re-add the filename
        return os.path.join(os.path.relpath(dest, origin), filename)  
    else:
        # if not dealing with any filepaths, just run relpath as usual
        return os.path.relpath(dest, origin)   

# match selected tags only
for page in pages_pagelist_args:
   print('Generating links on page', page, 'with pagelist arguments', pages_pagelist_args[page])
   for mdfile in pages_tags:
      # exclude myself
      if mdfile != page:
        if set(pages_pagelist_args[page]).issubset(pages_tags[mdfile]):
            print('Creating link for matched file located at', mdfile)
#            relative_path = os.path.relpath(mdfile, page)
            relative_path = realrelpath(page, mdfile)
            print('Link is', relative_path)

alanpt commented 1 month ago

Thanks. This is it integrated but I don't have time to test it right now.

import re
import os
from mkdocs.plugins import BasePlugin
from urllib.parse import urlsplit
from pathlib import Path

class PageListPlugin(BasePlugin):
    """
    A MkDocs plugin to generate dynamic lists of pages based on `{pagelist}` commands in markdown files.
    It supports grouping by folder, filtering by tags, and limiting the number of links.
    """

    def __init__(self):
        self.page_list_info = []

    def on_nav(self, nav, config, files):
        self.nav = nav
        self.files = files

        for file in files:
            self._gather_page_list_info(file)

    def _gather_page_list_info(self, file):
        try:
            with open(file.abs_src_path, 'r', encoding='utf-8') as f:
                content = f.read()
        except UnicodeDecodeError:
            try:
                with open(file.abs_src_path, 'r', encoding='latin-1') as f:
                    content = f.read()
            except Exception as e:
                print(f"Error reading file {file.abs_src_path}: {e}")
                return

        for match in re.finditer(r'\{pagelist(?:\s+(\d+|g|i)\s*(.*?))?(?:\|\s*(.*))?\}', content):
            page_list_code = match.group(0)
            page_url = file.url
            self.page_list_info.append({'page_url': page_url, 'page_list_code': page_list_code})

    def on_post_page(self, output, page, config):
        matches = re.finditer(r'\{pagelist(?:\s+(\d+|g|i)\s*(.*?))?(?:\|\s*(.*))?\}', output)

        for match in matches:
            if match.group(1) == 'i':
                page_list_output = self.generate_page_list_info_output(self.page_list_info, page)
                output = output.replace(match.group(0), page_list_output, 1)
            else:
                group_folders = match.group(1) == 'g'
                tags_to_filter = match.group(2).strip().split() if match.group(2) else page.meta.get('tags', [])
                limit = int(match.group(1)) if match.group(1) and match.group(1).isdigit() else None
                folders_to_filter = match.group(3).strip().split() if match.group(3) else []

                filtered_list = self._format_links_by_folder_and_tag(tags_to_filter, page, config, group_folders, limit, folders_to_filter)
                output = output.replace(match.group(0), filtered_list, 1)

        return output

    def generate_page_list_info_output(self, page_list_info, current_page):
        output = '<ol class="page-list-info">'
        for info in page_list_info:
            relative_path = self.realrelpath(current_page.url, info['page_url'])
            output += f"<li><a href='{relative_path}'>{info['page_url']}</a> - {info['page_list_code']}</li>"
        output += '</ol>'
        return output

    def _format_links_by_folder_and_tag(self, tags_to_filter, current_page, config, group_folders, limit, folders_to_filter):
        folder_groups = {}

        # Normalize the folders_to_filter list
        normalized_folders_to_filter = [folder.lower() for folder in folders_to_filter]

        for file in self.files:
            if file.page is not None and self._page_has_tags(file.page, tags_to_filter):
                folder_name = self._extract_folder_name(file.page.url).lower()

                # Check if the folder name matches any of the specified folders to filter
                if folders_to_filter and folder_name not in normalized_folders_to_filter:
                    continue  # Skip this page if its folder is not in the folders_to_filter list

                if folder_name not in folder_groups:
                    folder_groups[folder_name] = []
                folder_groups[folder_name].append(file.page)

        result = '<div class="pagelist">'
        item_count = 0  # Initialize item count

        for folder, pages in folder_groups.items():
            if group_folders:
                result += f'<h3 class="pagelistheading">{folder.capitalize()}</h3>\n'
            result += '<ul class="pagelistlist">\n'
            for page in pages:
                if limit is not None and item_count >= limit:
                    break  # Stop adding links once the limit is reached
                relative_path = self.realrelpath(current_page.url, page.url)
                result += f'<li><a href="{relative_path}">{page.title}</a></li>\n'
                item_count += 1
            result += '</ul>\n'
            if limit is not None and item_count >= limit:
                break  # Break the outer loop as well if the limit is reached

        result += '</div>'

        return result

    def _page_has_tags(self, page, tags_to_filter):
        if not tags_to_filter:
            return False  # Return False if no tags to filter

        page_tags = set(page.meta.get('tags', []))
        any_tags = {tag for tag in tags_to_filter if not tag.startswith('+') and not tag.startswith('-')}
        all_tags = {tag.lstrip('+') for tag in tags_to_filter if tag.startswith('+')}
        exclude_tags = {tag.lstrip('-') for tag in tags_to_filter if tag.startswith('-')}

        any_match = any(tag in page_tags for tag in any_tags) if any_tags else True
        all_match = all(tag in page_tags for tag in all_tags)
        exclude_match = not any(tag in page_tags for tag in exclude_tags)

        return any_match and all_match and exclude_match

    def _extract_folder_name(self, url):
        path_parts = Path(urlsplit(url).path).parts
        relevant_parts = path_parts[:-1]
        folder_title = ' '.join(part.capitalize() for part in relevant_parts)
        return folder_title

    # Copy the realrelpath function here
    def realrelpath(self, origin, dest):
        '''Get the relative path between two paths, accounting for filepaths'''

        # get the absolute paths so that strings can be compared
        origin = os.path.abspath(origin) 
        dest = os.path.abspath(dest) 

        # find out if the origin and destination are filepaths
        origin_isfile = os.path.isfile(origin)
        dest_isfile = os.path.isfile(dest)

        # if dealing with filepaths, 
        if origin_isfile or dest_isfile:
            # get the base filename
            filename = os.path.basename(dest) if origin_isfile else os.path.basename(dest)
            # in cases where we're dealing with a file, use only the directory name
            origin = os.path.dirname(origin) if origin_isfile else origin
            dest = os.path.dirname(dest) if dest_isfile else dest 
            # get the relative path between directories, then re-add the filename
            return os.path.join(os.path.relpath(dest, origin), filename)  
        else:
            # if not dealing with any filepaths, just run relpath as usual
            return os.path.relpath(dest, origin)

    def on_files(self, files, config):
        self.files = files
        return files

pitrson commented 1 month ago

Thanks! I have tested this and all of the originally described testcases now generate proper links!

one question: why is it not generating links directly to .md file which matches ? It only generates link to a parent directory, which is IMHO wrong, since you may have multiple docs in the directory. It works in my simple test env. since I have only single md. per directory and mkdocs automatically performs redirects

WARNING -  [19:53:16] "GET /test1/index1 HTTP/1.1" code 302
WARNING -  [19:53:19] "GET /test1/index2 HTTP/1.1" code 302
INFO    -  [19:53:20] Browser connected: http://localhost:8000/test1/index2/
INFO    -  [19:53:24] Browser connected: http://localhost:8000/test2/test2_2/index2/
WARNING -  [19:55:38] "GET /test1/index2 HTTP/1.1" code 302
INFO    -  [19:55:40] Browser connected: http://localhost:8000/test1/index2/

e.g instead of http://localhost:8000/test1/index2/doc.md it only generates link http://localhost:8000/test1/index2/ - but this was already the case before you implemented the fix proposed in your last post.

alanpt / mkdocs-pagelist-plugin

Links generated outside/above of the current location are broken #3