gcerretani / antenati

Tools to download data from Portale Antenati
MIT License
27 stars 9 forks source link

Add archive number to directory name #14

Closed jbellanca closed 1 year ago

jbellanca commented 2 years ago

Proposed changes to add the archive number to the end of the directory name. At least all the Antenati archives I've used, it's useful to know the archive number for later reference, and it's unique to the archive book itself, where the ID isn't.

gcerretani commented 2 years ago

Thanks a lot jbellanca, this fix was on my todo list since weeks.

Actually, archive ID detection is currently broken since URL format change on March. It uses the first number found in the URL. It was an unique ID for each archive, but now it return 12657 for all the archives in the Portale Antenati. The only thing to be done is to rewrite __get_archive_id, it is not necessary to create a new ID.

Consider replace the content of the current __get_archive_id with that of your new __get_archive_number. You should add also a check on the length of the result, that must be at least 2. It should become something like

@staticmethod
def __get_archive_id(url):
    """Get numeric archive ID from the URL"""
    archive_id_pattern = findall(r'(\d+)', url)
    if not archive_id_pattern or len(archive_id_pattern) < 2:
        raise RuntimeError(f'Cannot get archive ID from {url}')
    return archive_id_pattern[1]

Also, prefer single line include from the same module:

from re import findall, search