Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.64k stars 2.84k forks source link

What is MLTable able to filter? #30821

Open TomSB1423 opened 1 year ago

TomSB1423 commented 1 year ago

The documentation states that MLTable is able to filter using an "expression". What types of expressions are available to use?

I would like to be able to filter by datetime object e.g col(date) > 30/01/2023, however it seems like if I write this condition in the filter, MLTable does something else like a string → int conversion and then compares int. The filter does not work correctly for dates.

Would you be able to suggest an alternative approach if not?

pvaneck commented 1 year ago

Hey, thanks for the feedback. @azureml-github, can some guidance be provided for filtering?

github-actions[bot] commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github @Azure/azure-ml-sdk.

xiongrenyi commented 1 year ago

Can you try this? mltable.filter('Date >= datetime(2015, 7, 5, 23)') We'll add valid 'expression' in details to the document shortly

github-actions[bot] commented 1 year ago

Hi @TomSB1423. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

TomSB1423 commented 1 year ago

@xiongrenyi Thanks for your reply however this also does not seem to work.

Here is a senario. With an mltable of:

paths:
- file: data.parquet
transformations:
- read_parquet:
    include_path_column: false
    path_column: Path
type: mltable

When:

  >>> tbl = mltable.load(path)
  >>> tbl = tbl.filter('TimeFeature <= datetime(2022, 05, 16)')

The following is the mltable object with the filter applied:

paths:
- file: data.parquet
transformations:
- read_parquet:
    include_path_column: false
    path_column: Path
- filter: TimeFeature <= datetime(2022, 05, 16)
type: mltable

However when the following runs:

  >>> print(tbl.to_pandas_dataframe())

This generates a validation error:

  azureml.dataprep.api.errorhandlers.ValidationError:
  Error Code: ScriptExecution.Validation
  Validation Error Code: Invalid
  Validation Target: Python-Expression
  Native error: Unable to get operations from the given dataflow. Inner error: InvalidPythonExpression(ParseError { unconsumed: "(2022, 05, 16)", expression: "TimeFeature <= datetime(2022, 05, 16)" })
          GetOperationsError(InvalidPythonExpression(ParseError { unconsumed: "(2022, 05, 16)", expression: "TimeFeature <= datetime(2022, 05, 16)" }))
  => Error occurred while parsing Python Expression. ParseError { unconsumed: "(2022, 05, 16)", expression: "TimeFeature <= datetime(2022, 05, 16)" }
          InvalidPythonExpression(ParseError { unconsumed: "(2022, 05, 16)", expression: "TimeFeature <= datetime(2022, 05, 16)" })
  => Unconsumed expression: (2022, 05, 16), whole expression: TimeFeature <= datetime(2022, 05, 16)
          ParseError { unconsumed: "(2022, 05, 16)", expression: "TimeFeature <= datetime(2022, 05, 16)" }
  Error Message: Python expression parse error. Unconsumed part: '"(2022, 05, 16)"'. From source: '"TimeFeature <= datetime(2022, 05, 16)"'| session_id=l_0be6182f-9788-44ee-8144-cb6ac0597f1f
TomSB1423 commented 1 year ago

Any update on this?

samuel100 commented 1 year ago

@TomSB1423 you need to use col() i.e.

tbl = mltable.load(path)
tbl = tbl.filter("col('TimeFeature') <= datetime(2022, 05, 16)")

filter() function follows Python expressions: https://docs.python.org/3/reference/expressions.html#

Here is a full working example:

import mltable

# glob over paths in cloud storage
paths = [
    {
        "pattern": "wasbs://data@azuremlexampledata.blob.core.windows.net/yellow/*.parquet"
    }
]

tbl = mltable.from_parquet_files(paths)
tbl = tbl.filter("col('tpepPickupDateTime') <= datetime(2014,10,1)")
tbl.show()
TomSB1423 commented 1 year ago

Note: For future use - datetime objects must be in the format datetime(year, month, day) with month and day not containing 0 as a prefix. E.g. first of the month should be represented as "1" not "01". Example: datetime(2023, 1, 5) not datetime(2023, 01, 05)

lucasfijen commented 1 year ago

/unresolve I have actually a very similar question, to me it seems like only a very limited set of operators in these expressions is allowed, would it be an idea to update the documentation on this?

I'm trying to compare my column to a list of values I want to filter out of my mltable. In normal python expressions I would use the 'in' operator with a list comparison, so say for example col("name") in ["a", "b", "c"]. However this doesn't seem to be allowed syntax in the filter function, am I right to understand that my only way to achieve this would be to perform a (col("name") == "a") or (col("name") == "b") or (col("name") == "c")?

-sorry if this is hijacking the thread, but I felt like I was experiencing the same issue/question: The desire to have a better understanding of what exact python expressions are and are not allowed in this filter field

TomSB1423 commented 1 year ago

@lucasfijen I also agree that there should be better documentation for MLTable in general.

  1. What is MLTable able to filter - "expression" is just not detailed enough. Link
  2. convert_column_types appears to have missing information - link: image
chanccm commented 3 months ago

I'm trying to filter a datetime column by yyyy-mm-dd: tbl = tbl.filter("col('Sampling_DateTime') <= date_one_week_ago")

where 'date_one_week_ago' is a variable that changes every time I run the script.

What is the proper way to enter the variable 'date_one_week_ago' into the tbl.filter() please?