Open TomSB1423 opened 1 year ago
Hey, thanks for the feedback. @azureml-github, can some guidance be provided for filtering?
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github @Azure/azure-ml-sdk.
Can you try this? mltable.filter('Date >= datetime(2015, 7, 5, 23)') We'll add valid 'expression' in details to the document shortly
Hi @TomSB1423. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
@xiongrenyi Thanks for your reply however this also does not seem to work.
Here is a senario. With an mltable of:
paths:
- file: data.parquet
transformations:
- read_parquet:
include_path_column: false
path_column: Path
type: mltable
When:
>>> tbl = mltable.load(path)
>>> tbl = tbl.filter('TimeFeature <= datetime(2022, 05, 16)')
The following is the mltable object with the filter applied:
paths:
- file: data.parquet
transformations:
- read_parquet:
include_path_column: false
path_column: Path
- filter: TimeFeature <= datetime(2022, 05, 16)
type: mltable
However when the following runs:
>>> print(tbl.to_pandas_dataframe())
This generates a validation error:
azureml.dataprep.api.errorhandlers.ValidationError:
Error Code: ScriptExecution.Validation
Validation Error Code: Invalid
Validation Target: Python-Expression
Native error: Unable to get operations from the given dataflow. Inner error: InvalidPythonExpression(ParseError { unconsumed: "(2022, 05, 16)", expression: "TimeFeature <= datetime(2022, 05, 16)" })
GetOperationsError(InvalidPythonExpression(ParseError { unconsumed: "(2022, 05, 16)", expression: "TimeFeature <= datetime(2022, 05, 16)" }))
=> Error occurred while parsing Python Expression. ParseError { unconsumed: "(2022, 05, 16)", expression: "TimeFeature <= datetime(2022, 05, 16)" }
InvalidPythonExpression(ParseError { unconsumed: "(2022, 05, 16)", expression: "TimeFeature <= datetime(2022, 05, 16)" })
=> Unconsumed expression: (2022, 05, 16), whole expression: TimeFeature <= datetime(2022, 05, 16)
ParseError { unconsumed: "(2022, 05, 16)", expression: "TimeFeature <= datetime(2022, 05, 16)" }
Error Message: Python expression parse error. Unconsumed part: '"(2022, 05, 16)"'. From source: '"TimeFeature <= datetime(2022, 05, 16)"'| session_id=l_0be6182f-9788-44ee-8144-cb6ac0597f1f
Any update on this?
@TomSB1423 you need to use col()
i.e.
tbl = mltable.load(path)
tbl = tbl.filter("col('TimeFeature') <= datetime(2022, 05, 16)")
filter()
function follows Python expressions: https://docs.python.org/3/reference/expressions.html#
Here is a full working example:
import mltable
# glob over paths in cloud storage
paths = [
{
"pattern": "wasbs://data@azuremlexampledata.blob.core.windows.net/yellow/*.parquet"
}
]
tbl = mltable.from_parquet_files(paths)
tbl = tbl.filter("col('tpepPickupDateTime') <= datetime(2014,10,1)")
tbl.show()
Note: For future use - datetime objects must be in the format datetime(year, month, day) with month and day not containing 0 as a prefix. E.g. first of the month should be represented as "1" not "01". Example: datetime(2023, 1, 5) not datetime(2023, 01, 05)
/unresolve
I have actually a very similar question, to me it seems like only a very limited set of operators in these expressions is allowed, would it be an idea to update the documentation on this?
I'm trying to compare my column to a list of values I want to filter out of my mltable. In normal python expressions I would use the 'in' operator with a list comparison, so say for example col("name") in ["a", "b", "c"]
. However this doesn't seem to be allowed syntax in the filter function, am I right to understand that my only way to achieve this would be to perform a (col("name") == "a") or (col("name") == "b") or (col("name") == "c")
?
-sorry if this is hijacking the thread, but I felt like I was experiencing the same issue/question: The desire to have a better understanding of what exact python expressions are and are not allowed in this filter field
I'm trying to filter a datetime column by yyyy-mm-dd: tbl = tbl.filter("col('Sampling_DateTime') <= date_one_week_ago")
where 'date_one_week_ago' is a variable that changes every time I run the script.
What is the proper way to enter the variable 'date_one_week_ago' into the tbl.filter() please?
The documentation states that MLTable is able to filter using an "expression". What types of expressions are available to use?
I would like to be able to filter by datetime object e.g col(date) > 30/01/2023, however it seems like if I write this condition in the filter, MLTable does something else like a string → int conversion and then compares int. The filter does not work correctly for dates.
Would you be able to suggest an alternative approach if not?