edgi-govdata-archiving / wayback

A Python API to the Internet Archive Wayback Machine
https://wayback.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
59 stars 11 forks source link

Add info about media type to `Memento` #56

Open Mr0grog opened 3 years ago

Mr0grog commented 3 years ago

In EDGI’s web monitoring tools, we often look at the media type (also often referred to as MIME type or content type) of a Memento. One example is that we need to know how to parse the body in order to extract a title (you’d do very different things for HTML vs. PDF, for example). It might be nice to expose some sort of media type information on the Memento class.

We originally planned to do this in #2, but it wasn’t critical and there were enough open questions and options that it seemed worth waiting on coming up with a better design for:

Mr0grog commented 3 years ago

First-pass implementation of the complex type I had rigged up when first working on #2, before realizing all the other questions here. It may be useful in the future:

import re

ESCAPE_OR_QUOTE = re.compile(r'\\(.)|"')

class MediaType:
    """
    Represents a media type, like ``text/html``.

    For more information about media types, see `RFC 2045`_ and `RFC 2046`_.

    .. _RFC 2045: https://tools.ietf.org/html/rfc2045
    .. _RFC 2046: https://tools.ietf.org/html/rfc2046

    Attributes
    ----------
    type : str
        The top-level type, e.g. ``'text'`` in ``'text/html; charset=utf8'``.
    subtype : str
        The subtype, e.g. ``'html'`` in ``'text/html; charset=utf8'``.
    parameters : dict
        All the parameters that were specified, e.g. ``{'charset': 'utf-8'}``
        in ``'text/html; charset=utf8'``.
    media
    parameter_string
    """
    type = ''
    subtype = ''
    parameters = None

    def __init__(self, type, subtype, parameters=None):
        if not type or not subtype:
            raise ValueError('Type and subtype must be non-empty strings')

        self.type = type
        self.subtype = subtype
        self.parameters = parameters or {}

    @property
    def media(self):
        return f'{self.type}/{self.subtype}'

    @property
    def parameter_string(self):
        # FIXME: parameter values need to be quoted if they contain special
        # characters. See https://tools.ietf.org/html/rfc2045#section-5.1
        return '; '.join(f'{key}={value}' for key, value in self.parameters)

    def __str__(self):
        if self.parameters:
            return f'{self.media}; {self.parameter_string}'
        else:
            return self.media

    @classmethod
    def parse(cls, text, strict=True):
        """
        Build a :class:`wayback.MediaType` instance from a media type string,
        like ``'text/html; charset=utf8'``.

        Returns
        -------
        wayback.MediaType
        """
        main, _, parameter_text = text.partition(';')

        types = [item.strip().lower() for item in main.split('/', 1)]
        if len(types) != 2:
            if strict:
                raise ValueError(f'Malformed media type: "{text}"')
            else:
                types = ['application', 'octet-stream']

        parameters = {}
        to_parse = parameter_text
        while to_parse:
            name, _, to_parse = to_parse.partition('=')
            name = name.strip().lower()
            to_parse = to_parse.lstrip()
            if to_parse[0] == '"':
                value = ''
                position = 1
                while True:
                    match = ESCAPE_OR_QUOTE.search(to_parse, position)
                    if match:
                        if match.group(0) == '"':
                            value += to_parse[position:match.start()]
                            position = match.end()
                            break
                        else:
                            value += to_parse[position:match.start()] + match.group(1)
                            position = match.end()
                    elif strict:
                        raise ValueError(f'Media parameter "{name}" has no end')
                    else:
                        value += to_parse[position:]
                        position = len(to_parse)
                        break

                _, _, to_parse = to_parse[position:].partition(';')
            else:
                value, _, to_parse = to_parse.partition(';')
                value = value.strip()

            parameters[name] = value

        return cls(types[0], types[1], parameters)

media = MediaType.parse('text/html; mad=" oh yeah \\"unbalanced embedded string; crazy end\\\\"; another-thing = yeah')
print(f'Type:    "{media.type}"')
print(f'Subtype: "{media.subtype}"')
print('Parameters:')
for name, value in media.parameters.items():
    print(f'  |{name}| = |{value}|')
Mr0grog commented 3 years ago

(Also: I now know more than I knew there was to know about the syntax rules for HTTP headers and for Media Types.)