code-kern-ai / bricks

Open-source natural language enrichments at your fingertips.
Apache License 2.0
450 stars 23 forks source link

[MODULE] - Price extractor #37

Open jhoetter opened 1 year ago

jhoetter commented 1 year ago

Description I want to find prices in texts, e.g. "This notebook costs 2$". This module will output "2$". We make use of the spacy entity labelling to find entities that are labelled as MONEY.

Implementation

from typing import Dict, List, Tuple, Any
import spacy

def price_extractor(record: Dict[str, Any]) -> List[Tuple[str, int, int]]:
    text = record["text"]
    nlp = spacy.load("en_core_web_lg")
    doc = nlp(text)

    for entity in doc.ents:
        if entity.label_ == 'MONEY':
            yield "price", entity.start, entity.end

Additional information

jhoetter commented 1 year ago

I was building an extractor recently, and this brick can definitely be extended.

E.g., see this code:

import re
import json

def extract_currency_values(text):
    # Regex patterns for currency symbols and ISO codes
    currency_symbols = {
        "usd": r"\$[ ]*([\d,]+\.?\d*)",
        "eur": r"€[ ]*([\d,]+\.?\d*)",
        "cad": r"C\$[ ]*([\d,]+\.?\d*)",
    }

    currency_codes = {
        "usd": r"USD[ ]*([\d,]+\.?\d*)",
        "eur": r"EUR[ ]*([\d,]+\.?\d*)",
        "cad": r"CAD[ ]*([\d,]+\.?\d*)",
    }

    results = []

    # Check for matches using currency symbols
    for currency, pattern in currency_symbols.items():
        matches = re.findall(pattern, text, re.IGNORECASE)
        for match in matches:
            results.append({
                "currency": currency.upper(),
                "value": match.replace(",", "")
            })

    # Check for matches using currency codes
    for currency, pattern in currency_codes.items():
        matches = re.findall(pattern, text, re.IGNORECASE)
        for match in matches:
            results.append({
                "currency": currency.upper(),
                "value": match.replace(",", "")
            })

    return json.dumps(results, indent=4)

# testing the function
print(extract_currency_values("USD 100,000, CAD100000, EUR100.000, 45€"))

In this case, i also grab the USD, CAD, EUR values as well. @LeonardPuettmannKern