benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
681 stars 42 forks source link

x:parse-dateTime error on 12:nn PM where nn is > 0 #102

Closed spudmusket closed 1 year ago

spudmusket commented 1 year ago

There appears to be an error in the date time parsing function when using a 12 hour clock after 12 PM.

Example

String to parse: "13/03/2023 12:50:00 PM"

Parse function: $date_time:=x:parse-dateTime(.,"d/mm/yyyy h:nn:ss am/pm")

Result: err:FORG0001: Invalid conversion from 13/03/2023 12:50:00 PM to date format dd/mm/yyyy h:nn:ss am/pm

Expected result: "2023-03-14T00:50:00"

Parses correctly for "13/03/2023 12:00:00 PM" resulting in "2023-03-14T00:00:00" but fails for all times between 12:00:01 PM - 12:59:59 PM.

benibela commented 1 year ago

I thought that would be written as 00:50:00 PM

Perhaps it should report "00:50:00 PM" as invalid?

Here is a patch :datetime.txt

I did not commit it in case it breaks something else.

spudmusket commented 1 year ago

A 12 hour clock does not display 00 as an hour. So no, it is not written as 00:50:00 PM. The string to parse that fails is between 12:00:01 PM and 12:59:59 PM (these are valid 12 hour clock times).

I am not in a position to compile the code so can't test it, sorry.

I have written an xquery function to work around the issue for my application.

Reino17 commented 1 year ago

Hello Benito,

Here is a patch :datetime.txt

I've made an attempt, but failed.

I grabbed the latest from http://hg.code.sf.net/p/videlibri/code, but I believe that's not the same source you're using. I actually had to split your patch, because 'components/pascal/data/bbutils.pas' is CR+LF, while _'components/pascal/data/xquerytypes.inc' is LF. Patching these files worked after that, though lots of "hunks" with different offsets. But then...

-- Building...
Free Pascal Compiler version 3.2.2 [2021/05/15] for i386
Copyright (c) 1993-2021 by Florian Klaempfl and others
Target OS: Win32 for i386
Compiling xidel.pas
Compiling [...]\videlibri-code_hg\components\pascal\internet\internetaccess.pas
Compiling [...]\videlibri-code_hg\components\pascal\data\bbutils.pas
bbutils.pas(1512,15) Warning: Symbol "strDecodeUTF8Character" is deprecated: "Use (pchar,pchar) overload or strIterator."
bbutils.pas(6058,13) Error: Identifier not found "beginsWithCaseInsensitively"
bbutils.pas(6473,13) Warning: Symbol "strEncodeHex" is deprecated
bbutils.pas(6828,16) Error: identifier idents no member "beginsWithCaseInsensitively"
bbutils.pas(6835,16) Error: identifier idents no member "beginsWithCaseInsensitively"
bbutils.pas(6879,22) Error: identifier idents no member "viewFirst"
bbutils.pas(6917,24) Error: identifier idents no member "viewFirst"
bbutils.pas(6921,24) Error: identifier idents no member "viewFirst"
bbutils.pas(7314) Fatal: There were 6 errors compiling module, stopping
Fatal: Compilation aborted
benibela commented 1 year ago

because 'components/pascal/data/bbutils.pas' is CR+LF, while 'components/pascal/data/xquery_types.inc' is LF

how odd

bbutils.pas(6058,13) Error: Identifier not found "beginsWithCaseInsensitively"

i committed those functions: http://hg.code.sf.net/p/videlibri/code/rev/99302f05da0d

Reino17 commented 1 year ago

I'm not sure what happened, but after a fresh hg clone, it worked. No compilation-errors (lots of warnings though) and no "hunks" with different offsets either. Lots of files do still have CRLF line ending though (find . -not -type d -exec file "{}" ";" | grep CRLF tells me it's a long list).

$ xidel -se '
  ("11:59:59","12:00:00","12:00:01","12:59:59") ! x:parse-dateTime(
    x"13/03/2023 {.} PM","d/mm/yyyy h:nn:ss am/pm"
  )
'
2023-03-13T23:59:59
2023-03-14T00:00:00
2023-03-14T00:00:01
2023-03-14T00:59:59

Your actual patch appears to work fine.

I did not commit it in case it breaks something else.

Like what exactly?

spudmusket commented 1 year ago

"13/03/2023 11:59:59 PM" should return "2023-03-13T23:59:59" "13/03/2023 12:00:00 PM" should return "2023-03-13T12:00:00" "13/03/2023 12:00:01 PM" should return "2023-03-13T12:00:01"

i.e. AM is before noon PM is after noon. 12 AM is midnight, 12PM is noon?

At noon the 12 hour clock rollover is: 11:59:59AM 12:00:00PM 12:00:01PM (24HR equiv: 11:59:59 12:00:00 12:00:01) At midnight the 12 hour clock rollover is: 11:59:59PM 12:00:00AM 12:00:01AM (24HR equiv: 23:59:59 00:00:00 00:00:01)

So the above output is NOT correct?

Looks like I pasted the wrong expected result in my original post :(

Reino17 commented 1 year ago

Ah, yes, you're right. The output for 12:00:00-12:59:59AM and 12:00:00-12:59:59PM should be turned around.

benibela commented 1 year ago

Lots of files do still have CRLF line ending though (find . -not -type d -exec file "{}" ";" | grep CRLF tells me it's a long list).

That seem to be mostly testcases and files created before I moved to linux around 2009

I did not commit it in case it breaks something else. Like what exactly?

Any datetime parsing I did not have tests for

It is not just a patch, but a complete reimplementation of the datetime parsing. At first it broke a lot of tests.

At least it is faster, too

Looks like I pasted the wrong expected result in my original post :(

I implemented it from the post...

At noon the 12 hour clock rollover is: 11:59:59AM 12:00:00PM 12:00:01PM (24HR equiv: 11:59:59 12:00:00 12:00:01) At midnight the 12 hour clock rollover is: 11:59:59PM 12:00:00AM 12:00:01AM (24HR equiv: 23:59:59 00:00:00 00:00:01)

And it is always 13/03/2023 not 14/03/2023?

Your post lead me astray

That would have been a simple patch. In the old code am/pm could only affect the hour not the day/date.

Here is a new patch: datetime2.txt

spudmusket commented 1 year ago

And it is always 13/03/2023 not 14/03/2023?

On the midnight rollover it will obviously change from 13/03/2023 to 14/03/2023 as it changes from 11:59:59 PM to 12:00:00 AM, but yes, this was not the original problem with PM as it is at midday and no date change is required.

Thanks for your efforts - if the patch is rolled out, I should be able to test it once available for my OS.

Reino17 commented 1 year ago

That seem to be mostly testcases and files created before I moved to linux around 2009

Well, it's a bit more than that. See 'find_CRLF.txt'. Apparently the local repo you're working with is diferent than http://hg.code.sf.net/p/videlibri/code, because I don't think anyone can apply your patches here as is, because of the different line-endings. Please fix that (convert all files to LF line-ending?).

Any datetime parsing I did not have tests for

I've done some quick testing with some of the dateTime-conversions I'm doing with my Xivid. All works perfectly.

Here is a new patch: datetime2.txt

Applied your new patch and compiled a fresh binary. It appears to produce the correct results this time:

$ xidel -se '
  ("12:00:00","12:00:01","12:59:59","11:59:59") ! x:parse-dateTime(x"13/03/2023 {.} AM","d/mm/yyyy h:nn:ss am/pm"),
  ("12:00:00","12:00:01","12:59:59","11:59:59") ! x:parse-dateTime(x"13/03/2023 {.} PM","d/mm/yyyy h:nn:ss am/pm")
'
2023-03-13T00:00:00
2023-03-13T00:00:01
2023-03-13T00:59:59
2023-03-13T11:59:59
2023-03-13T12:00:00
2023-03-13T12:00:01
2023-03-13T12:59:59
2023-03-13T23:59:59
benibela commented 1 year ago

merged: https://github.com/benibela/internettools/commit/b859016179542d2b651fde7065354c2d0867e8fe

On the midnight rollover it will obviously change from 13/03/2023 to 14/03/2023 as it changes from 11:59:59 PM to 12:00:00 AM, but yes, this was not the original problem with PM as it is at midday and no date change is required.

no, there it becomes 13/03/2023 24:00:00, which is valid for the 24-clock

spudmusket commented 1 year ago

no, there it becomes 13/03/2023 24:00:00, which is valid for the 24-clock

Nope. There is no such time as 24:00:00 on either a 24hr clock or a 12hr clock. My comment was very obviously related to a 12hr clock and is correct!

12hr and 24hr time is standard across the world - I'm not sure why it seems to be so difficult...

benibela commented 1 year ago

But in XPath/XML Schema, 24:00:00 is a valid time. xs:time("24:00:00") is supposed to work and xs:time("24:00:01") is an error

spudmusket commented 1 year ago

I suggest you read the ISO8601 standard and the W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. 24:00 is used as a reference to end of day in natural language only...

benibela commented 1 year ago

24:00:00 is listed as allowed value in the lexical representation: https://www.w3.org/TR/xmlschema/#time-lexical-mapping

spudmusket commented 1 year ago

From that document:

hour: an integer between 0 and 23 inclusive

This is the same as the ISO standard.

I give up, you seem to want to find what you want to believe, even if it is clearly wrong.

benibela commented 1 year ago

hour: an integer between 0 and 23 inclusive

That is how it is stored, not what is accepted as input

spudmusket commented 1 year ago

You can store it however you want - epoch for example - nothing to do with what I outlined above.

Again - I suggest you read the ISO standard.

I'll not comment further, as you seem to have made your mind up incorrectly re a time of 24:00:00...

Reino17 commented 1 year ago

12hr and 24hr time is standard across the world

No, it's not.

As benibela mentioned, xs:time("24:00:00") is simply valid for XQuery/XPath.

https://www.w3.org/TR/xpath-functions-31/#date-time-values:

For xs:time, 00:00:00 and 24:00:00 are alternate lexical forms for the same value, whose canonical representation is 00:00:00. For xs:dateTime, a time component 24:00:00 translates to 00:00:00 of the following day.

https://www.w3.org/TR/xpath-functions-31/#date-time-lexical-mapping:

An xs:dateTime with lexical representation 1999-12-31T24:00:00 is represented by {2000, 1, 1, 0, 0, 0.0, ()}.

Practically the same discussion over here.