Regular Expression Flag Placement Error in extract_text Method

vaishakhRaveendran commented 1 month ago

I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug Error in regex_based.py, extract_text method:"global flags not at the start" Issue with regular expression pattern in ragas_experimental/testset/extractors/regex_based.py. Global flags in regex are incorrectly positioned. Need to move flags to start of pattern or use as separate arguments.

Ragas version: v0.1.10 Python version: 3.11.4

Code to Reproduce

from langchain_community.document_loaders import DirectoryLoader
from ragas_experimental.testset import SimpleTestGenerator
from langchain_community.document_loaders import TextLoader
import pandas as pd
import nest_asyncio
nest_asyncio.apply()

# Initialize the loader

loader = DirectoryLoader("./experimental_notebook")

try:
    # Load the documents
    docs = loader.load()
    print(f"Loaded {len(docs)} documents")

    # Initialize the generator
    generator = SimpleTestGenerator()
    print(" \n Run generator method \n")

    # Generate the test dataset
    testdataset = generator.generate(docs, test_size=10)
    print("TestDataset generated successfully")

    # Convert to pandas DataFrame using the to_pandas method
    df = testdataset.to_pandas()
    print("DataFrame created successfully")

    # Display the first few rows of the DataFrame
    print(df.head())

    # Optionally, save the DataFrame to a CSV file
    df.to_csv("testdataset.csv", index=False)
    print("CSV file saved successfully")

except AttributeError as e:
    print(f"AttributeError: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

Error trace

Expected behavior Expected to extract links, emails and markdown heading and add to the metadata

Additional context I believe the error is due to an issue with combining different rule-based extractions.

dosubot[bot] commented 1 month ago

The error you're encountering, "global flags not at the start of expression," indicates that the regular expression flags are not correctly positioned in the pattern. In Python, regular expression flags should be placed at the start of the pattern or passed as separate arguments to the re.compile function.

Here's how you can fix the issue:

Place Flags at the Start of the Pattern: Ensure that any flags (like re.IGNORECASE, re.MULTILINE, etc.) are placed at the beginning of the regular expression pattern using the (?i), (?m), etc., syntax.
Pass Flags as Separate Arguments: Alternatively, you can pass the flags as separate arguments to the re.compile function.

Here's an example of both approaches:

Example 1: Flags at the Start of the Pattern

import re

pattern = r"(?i)your_pattern_here"
compiled_pattern = re.compile(pattern)

Example 2: Flags as Separate Arguments

import re

pattern = r"your_pattern_here"
compiled_pattern = re.compile(pattern, re.IGNORECASE)

To apply this to your extract_text method in regex_based.py, locate the part of the code where the regular expression is defined and ensure the flags are correctly positioned.

If you provide the specific regular expression pattern from your extract_text method, I can give more precise guidance on how to adjust it.

To continue talking to Dosu, mention @dosu.

vaishakhRaveendran commented 1 month ago

@dosu The global flags are placed in the middle when the merge_extractor() method tries to combine the extractors based on common attributes or on whether they are multiline or not.

dosubot[bot] commented 1 month ago

The issue with global flags being placed in the middle when the merge_extractor() method combines extractors can be resolved by ensuring that the flags are correctly positioned at the start of the pattern or passed as separate arguments during the combination process.

To address this, you can modify the merge_extractor() method to handle the flags appropriately. Here are the steps you can follow:

Extract Flags and Patterns Separately: When combining patterns, extract the flags and patterns separately.
Combine Patterns and Apply Flags: Combine the patterns and then apply the flags either at the start of the combined pattern or as separate arguments.

Here's a conceptual example of how you might adjust the merge_extractor() method:

Example Adjustment for `merge_extractor()`

import re

def merge_extractor(patterns, flags):
    combined_pattern = "|".join(patterns)
    combined_flags = 0
    for flag in flags:
        combined_flags |= flag

    # Option 1: Apply flags at the start of the pattern
    combined_pattern_with_flags = f"(?{combined_flags}){combined_pattern}"
    compiled_pattern = re.compile(combined_pattern_with_flags)

    # Option 2: Pass flags as separate arguments
    compiled_pattern = re.compile(combined_pattern, combined_flags)

    return compiled_pattern

# Example usage
patterns = [r"pattern1", r"pattern2"]
flags = [re.IGNORECASE, re.MULTILINE]
compiled_pattern = merge_extractor(patterns, flags)

In this example, patterns is a list of regular expression patterns, and flags is a list of flags. The method combines the patterns and applies the flags correctly.

Ensure that the merge_extractor() method in your regex_based.py file follows a similar approach to handle the flags properly. This should resolve the issue of global flags being incorrectly positioned.

To continue talking to Dosu, mention @dosu.

yasyf commented 1 month ago

@jjmachan this is the bug I hit

jjmachan commented 1 month ago

@yasyf are you using the experimental testset generation also? @vaishakhRaveendran thanks for bringing this up, will get this merged right now 🙂

explodinggradients / ragas