HumanSignal / label-studio

Label Studio is a multi-type data labeling and annotation tool with standardized output format
https://labelstud.io
Apache License 2.0
19.37k stars 2.41k forks source link

Time series data load from csv stalls if timestamps have timezone information appended #6589

Open pmayostendorp opened 2 weeks ago

pmayostendorp commented 2 weeks ago

Describe the bug I have two separate csv files used to create "activity recognition tasks". Literally the only difference between the two is the format of the timestamps. File 1 was generated directly from time series data stored in a pandas.DataFrame object, using a pandas.DatetimeIndex for the index. It was then exported using pandas.DataFrame.to_csv(..., index=True). This will generate datetime-aware timestamps in the following format: YYYY-MM-DD hh:mm:ss.f+00:00 which looks like 2024-04-11 12:25:59.487929+00:00. File 2 is the same file, but with the timezone information scrubbed from the datetime (e.g. YYYY-MM-DD hh:mm:ss.f which looks like 2024-04-11 12:25:59.487929.

When file 1 is loaded, the data will not load and a spinner shows indefinitely: image

The console logs several errors that are not particularly useful/diagnostic and may not even be related, mostly various "cannot read properties of undefined" errors, which leads me to think this could be masking issues identified in other bug reports like this one. Kudos to this comment which pointed me to this issue in the first place.

File 2 loads normally, so the format is obviously the culprit.

To Reproduce

  1. Create a simple time series csv file with the following format for datetimes in the first column: YYYY-MM-DD hh:mm:ss.f+00:00 which looks like 2024-04-11 12:25:59.487929+00:00
  2. Save a copy of this file, but with the timezone info removed from the datetimes. They should be in the format: YYYY-MM-DD hh:mm:ss.f which looks like 2024-04-11 12:25:59.487929.
  3. Generate tasks for both of these time series.
  4. Set up a time series activity recognition or similar annotation template.
  5. Try to annotate file 1 and observe spinner.
  6. Try to annotate file 2 and see data load. Load data, load.

Expected behavior Data loads into template in step 5 as well. OR

Environment (please complete the following information):

heidi-humansignal commented 1 week ago

Hello,

Label Studio's TimeSeries tag requires that the timeFormat parameter in your labeling configuration matches the exact format of your timestamps. The presence of timezone information in the format +00:00 can cause parsing issues because Python's strptime function, which Label Studio uses internally, does not support parsing timezone offsets with a colon.

Solution: To resolve this issue, you can adjust your timestamp format or modify the timeFormat parameter.

Option 1: Modify the Timestamp Format Since the timezone information is causing the parsing issue, you can preprocess your CSV file to remove the timezone offset from the timestamps. Here's how you can do it using Pandas:

import pandas as pd

# Read your original CSV with timezone info
df = pd.read_csv('file_with_timezone.csv')

# Convert 'timestamp' column to datetime and remove timezone
df['timestamp'] = pd.to_datetime(df['timestamp']).dt.strftime('%Y-%m-%d %H:%M:%S.%f')

# Save the modified CSV without timezone info
df.to_csv('file_without_timezone.csv', index=False)

Thank you, Abu

Comment by Abubakar Saad Workflow Run