freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
378 stars 111 forks source link

feat(coloctapp): implement cleanup_content #1170

Closed grossir closed 2 months ago

grossir commented 2 months ago

Helps solve:

Implement cleanup_content, only for coloctapp:

grossir commented 2 months ago

To test this you will need to copy paste

import hashlib

def sha1(s):
    """Return the sha1sum of a string.

    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    ! This algorithm is obsolete for most purposes. Its !
    ! usage is discouraged. Please use SHA256 instead.  !
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

    :param s: The data to hash. Ideally bytes, but if unicode is passed in, it
    will convert it to bytes first.
    :return: a hexadecimal SHA1 hash of the data
    """
    if isinstance(s, str):
        s = s.encode()
    sha1sum = hashlib.sha1()
    sha1sum.update(s)
    return sha1sum.hexdigest()

And then call it where appropiate in the sample_caller


        # cleanup_content is called before the extraction task in CL
        # so it is only useful for cleaning HTML files
        data = site.cleanup_content(data)
        logger.info(sha1(data))

Perhaps we should add this to the sample_caller...