centre-for-humanities-computing / danish-foundation-models

A project for training foundational Danish language model
https://foundationmodels.dk
MIT License
68 stars 4 forks source link

Infomedia cleaning edge cases #255

Closed jankounchained closed 3 months ago

jankounchained commented 3 months ago

if you want to be extra careful, then you control for two edge cases:

so something like this:

def remove_html_tags(text: str) -> str:
    """Remove HTML tags from a string."""
    html_tag_pattern = re.compile('<.*?>', flags=re.MULTILINE)
    clean_text = re.sub(html_tag_pattern, " ", text)
    return clean_text

def remove_whitespace(text: str) -> str:
    """remove excess whitespace from text fields"""
    pat_ws = re.compile(pattern=r"\s\s+", flags=re.MULTILINE)
    clean_text = re.sub(pat_ws, " ", text)
    return clean_text
KennethEnevoldsen commented 3 months ago

@jankounchained do you plan to fix this or should @TTTTao725 ?

TTTTao725 commented 3 months ago

Sure, no problem! I'll be working on it tomorrow, I‘m volunteering for a user test today, and tomorrow I might be wearing a brace 😹

KennethEnevoldsen commented 3 months ago

@TTTTao725 is this issue resolved?

TTTTao725 commented 3 months ago

yes, please check out this PR: https://github.com/centre-for-humanities-computing/danish-foundation-models/pull/257