GeoscienceAustralia / agdc

Repository for Australian Geoscience Data Cube (AGDC) code
BSD 3-Clause "New" or "Revised" License
29 stars 24 forks source link

Durable/invariant acquisition identifier #69

Closed smr547 closed 9 years ago

smr547 commented 9 years ago

Since its inception, WOfS has encountered problems with the lack of a stable identifier for "tiles/acquisitions". WOfS currently uses the AGDC convention of "start timestamp" which is available via the API and also encoded into the dataset filename.

One problem has been that there is fuzziness associated with this value. In the past, should a tile/acquisition be re-ingested the "start timestamp" is not guaranteed to retain the same value. Any application running incremental updates may incorrectly interpret slightly changed start times as new tile.

The micorsecond resolution of these timestamps also causes problems with string representation of the timestamp changing depending on the value of (or absence of) the microsecond component. A general solution to this problem requires some REGEX gymnastics (e.g. https://github.com/smr547/ga-neo-nfrip/blob/luigi_api_refactor/wofs/timeparser.py)

Anyway, I'd like to suggest that we consider using a combination of satellite_id, "orbit number" and AGCD "cell_ID" as a durable acquisition identifier within any particular dataset type.

This ID has the highly desirable property of not changing between ingestions. The orbit number is readily available from TLEs and could be computed during pre-ingestion processing (if it is not already).

The identifier's value can also be easily predetermined. i.e. We can reliably state what acquisition IDs will be generated during the next (or any) orbit of LS8. This has significant benefits when developing QA procedures (i.e. identifiying gaps in our data).

Obviously it also eliminates the possibility of data duplication.

As an example

LS5_TM_NBAR150-034_1991-01-09T23-03-58.241019.tif

might become

LS5_NBAR150-034_23456.tif

wenjunwu commented 9 years ago

Be aware that “orbit number” is not a reliable value. Depending on how to define the origin, we have countered +/-2 issue for a same pass. We have deliberately avoided to use “orbit number” as an identifier. Location, time and sensor information are the correct combination to form a unique identifier.

Regarding the issue caused by microsecond resolution, I would say this is an implantation error. When we dealing scientific computing, we should never treat a fractional value like an integer. For example when we compare two float value f1 and f2, we never check if (f1==f2) but check if abs(f1-f2) less than an acceptable range.

From: Steven Ring [mailto:notifications@github.com] Sent: Thursday, 9 July 2015 12:14 PM To: GeoscienceAustralia/agdc Subject: [agdc] Durable/invariant acquisition identifier (#69)

Since its inception, WOfS has encountered problems with the lack of a stable identifier for "tiles/acquisitions". WOfS currently uses the AGDC convention of "start timestamp" which is available via the API and also encoded into the dataset filename.

One problem has been that there is fuzziness associated with this value. In the past, should a tile/acquisition be re-ingested the "start timestamp" is not guaranteed to retain the same value. Any application running incremental updates may incorrectly interpret slightly changed start times as new tile.

The micorsecond resolution of these timestamps also causes problems with string representation of the timestamp changing depending on the value of (or absence of) the microsecond component. A general solution to this problem requires some REGEX gymnastics (e.g. https://github.com/smr547/ga-neo-nfrip/blob/luigi_api_refactor/wofs/timeparser.py)

Anyway, I'd like to suggest that we consider using a combination of satellite_id, "orbit number" and AGCD "cell_ID" as a durable acquisition identifier within any particular dataset type.

This ID has the highly desirable property of not changing between ingestions. The orbit number is readily available from TLEs and could be computed during pre-ingestion processing (if it is not already).

The identifier's value can also be easily predetermined. i.e. We can reliably state what acquisition IDs will be generated during the next (or any) orbit of LS8. This has significant benefits when developing QA procedures (i.e. identifiying gaps in our data).

Obviously it also eliminates the possibility of data duplication.

As an example

LS5_TM_NBAR150-034_1991-01-09T23-03-58.241019.tif

might become

LS5_NBAR150-034_23456.tif

— Reply to this email directly or view it on GitHubhttps://github.com/GeoscienceAustralia/agdc/issues/69.

Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is intended only for the person or entity to which it is addressed. If you are not the intended recipient, then you have received this e-mail by mistake and any use, dissemination, forwarding, printing or copying of this e-mail and its file attachments is prohibited. The security of emails transmitted cannot be guaranteed; by forwarding or replying to this email, you acknowledge and accept these risks.

smr547 commented 9 years ago

I'm surprised orbit number cannot be precisely determined. Does that mean that the 'revolution number' in the published TLEs is unreliable? On Jul 14, 2015 10:09 AM, "Wenjun Wu" notifications@github.com wrote:

Be aware that “orbit number” is not a reliable value. Depending on how to define the origin, we have countered +/-2 issue for a same pass. We have deliberately avoided to use “orbit number” as an identifier. Location, time and sensor information are the correct combination to form a unique identifier.

Regarding the issue caused by microsecond resolution, I would say this is an implantation error. When we dealing scientific computing, we should never treat a fractional value like an integer. For example when we compare two float value f1 and f2, we never check if (f1==f2) but check if abs(f1-f2) less than an acceptable range.

From: Steven Ring [mailto:notifications@github.com] Sent: Thursday, 9 July 2015 12:14 PM To: GeoscienceAustralia/agdc Subject: [agdc] Durable/invariant acquisition identifier (#69)

Since its inception, WOfS has encountered problems with the lack of a stable identifier for "tiles/acquisitions". WOfS currently uses the AGDC convention of "start timestamp" which is available via the API and also encoded into the dataset filename.

One problem has been that there is fuzziness associated with this value. In the past, should a tile/acquisition be re-ingested the "start timestamp" is not guaranteed to retain the same value. Any application running incremental updates may incorrectly interpret slightly changed start times as new tile.

The micorsecond resolution of these timestamps also causes problems with string representation of the timestamp changing depending on the value of (or absence of) the microsecond component. A general solution to this problem requires some REGEX gymnastics (e.g. https://github.com/smr547/ga-neo-nfrip/blob/luigi_api_refactor/wofs/timeparser.py)

Anyway, I'd like to suggest that we consider using a combination of satellite_id, "orbit number" and AGCD "cell_ID" as a durable acquisition identifier within any particular dataset type.

This ID has the highly desirable property of not changing between ingestions. The orbit number is readily available from TLEs and could be computed during pre-ingestion processing (if it is not already).

The identifier's value can also be easily predetermined. i.e. We can reliably state what acquisition IDs will be generated during the next (or any) orbit of LS8. This has significant benefits when developing QA procedures (i.e. identifiying gaps in our data).

Obviously it also eliminates the possibility of data duplication.

As an example

LS5_TM_NBAR150-034_1991-01-09T23-03-58.241019.tif

might become

LS5_NBAR150-034_23456.tif

— Reply to this email directly or view it on GitHub< https://github.com/GeoscienceAustralia/agdc/issues/69>.

Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is intended only for the person or entity to which it is addressed. If you are not the intended recipient, then you have received this e-mail by mistake and any use, dissemination, forwarding, printing or copying of this e-mail and its file attachments is prohibited. The security of emails transmitted cannot be guaranteed; by forwarding or replying to this email, you acknowledge and accept these risks.


— Reply to this email directly or view it on GitHub https://github.com/GeoscienceAustralia/agdc/issues/69#issuecomment-121096061 .

wenjunwu commented 9 years ago

For example, LS5 had been parked once. When it came back to service, the orbit number had been reset. Some satellites have different starting orbit number in TLE v.s. ancillary provided by satellite owner (e.g. Radarsat-1). Furthermore, TLE is a predict orbit not actual. The orbit number can be off by one when it closes to equator for an ascending pass.

From: Steven Ring [mailto:notifications@github.com] Sent: Tuesday, 14 July 2015 11:24 AM To: GeoscienceAustralia/agdc Cc: Wu Wenjun Subject: Re: [agdc] Durable/invariant acquisition identifier (#69)

I'm surprised orbit number cannot be precisely determined. Does that mean that the 'revolution number' in the published TLEs is unreliable? On Jul 14, 2015 10:09 AM, "Wenjun Wu" notifications@github.com<mailto:notifications@github.com> wrote:

Be aware that “orbit number” is not a reliable value. Depending on how to define the origin, we have countered +/-2 issue for a same pass. We have deliberately avoided to use “orbit number” as an identifier. Location, time and sensor information are the correct combination to form a unique identifier.

Regarding the issue caused by microsecond resolution, I would say this is an implantation error. When we dealing scientific computing, we should never treat a fractional value like an integer. For example when we compare two float value f1 and f2, we never check if (f1==f2) but check if abs(f1-f2) less than an acceptable range.

From: Steven Ring [mailto:notifications@github.com] Sent: Thursday, 9 July 2015 12:14 PM To: GeoscienceAustralia/agdc Subject: [agdc] Durable/invariant acquisition identifier (#69)

Since its inception, WOfS has encountered problems with the lack of a stable identifier for "tiles/acquisitions". WOfS currently uses the AGDC convention of "start timestamp" which is available via the API and also encoded into the dataset filename.

One problem has been that there is fuzziness associated with this value. In the past, should a tile/acquisition be re-ingested the "start timestamp" is not guaranteed to retain the same value. Any application running incremental updates may incorrectly interpret slightly changed start times as new tile.

The micorsecond resolution of these timestamps also causes problems with string representation of the timestamp changing depending on the value of (or absence of) the microsecond component. A general solution to this problem requires some REGEX gymnastics (e.g. https://github.com/smr547/ga-neo-nfrip/blob/luigi_api_refactor/wofs/timeparser.py)

Anyway, I'd like to suggest that we consider using a combination of satellite_id, "orbit number" and AGCD "cell_ID" as a durable acquisition identifier within any particular dataset type.

This ID has the highly desirable property of not changing between ingestions. The orbit number is readily available from TLEs and could be computed during pre-ingestion processing (if it is not already).

The identifier's value can also be easily predetermined. i.e. We can reliably state what acquisition IDs will be generated during the next (or any) orbit of LS8. This has significant benefits when developing QA procedures (i.e. identifiying gaps in our data).

Obviously it also eliminates the possibility of data duplication.

As an example

LS5_TM_NBAR150-034_1991-01-09T23-03-58.241019.tif

might become

LS5_NBAR150-034_23456.tif

— Reply to this email directly or view it on GitHub< https://github.com/GeoscienceAustralia/agdc/issues/69>.

Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is intended only for the person or entity to which it is addressed. If you are not the intended recipient, then you have received this e-mail by mistake and any use, dissemination, forwarding, printing or copying of this e-mail and its file attachments is prohibited. The security of emails transmitted cannot be guaranteed; by forwarding or replying to this email, you acknowledge and accept these risks.


— Reply to this email directly or view it on GitHub https://github.com/GeoscienceAustralia/agdc/issues/69#issuecomment-121096061 .

— Reply to this email directly or view it on GitHubhttps://github.com/GeoscienceAustralia/agdc/issues/69#issuecomment-121105733.

Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is intended only for the person or entity to which it is addressed. If you are not the intended recipient, then you have received this e-mail by mistake and any use, dissemination, forwarding, printing or copying of this e-mail and its file attachments is prohibited. The security of emails transmitted cannot be guaranteed; by forwarding or replying to this email, you acknowledge and accept these risks.

smr547 commented 9 years ago

Thanks Wenjun. Orbit number looks like a slippery eel. Thanks for enlightening me.

I'm happy for this suggestion/issue to be closed :D

On 14 July 2015 at 12:52, Wenjun Wu notifications@github.com wrote:

For example, LS5 had been parked once. When it came back to service, the orbit number had been reset. Some satellites have different starting orbit number in TLE v.s. ancillary provided by satellite owner (e.g. Radarsat-1). Furthermore, TLE is a predict orbit not actual. The orbit number can be off by one when it closes to equator for an ascending pass.

From: Steven Ring [mailto:notifications@github.com] Sent: Tuesday, 14 July 2015 11:24 AM To: GeoscienceAustralia/agdc Cc: Wu Wenjun Subject: Re: [agdc] Durable/invariant acquisition identifier (#69)

I'm surprised orbit number cannot be precisely determined. Does that mean that the 'revolution number' in the published TLEs is unreliable? On Jul 14, 2015 10:09 AM, "Wenjun Wu" <notifications@github.com<mailto: notifications@github.com>> wrote:

Be aware that “orbit number” is not a reliable value. Depending on how to define the origin, we have countered +/-2 issue for a same pass. We have deliberately avoided to use “orbit number” as an identifier. Location, time and sensor information are the correct combination to form a unique identifier.

Regarding the issue caused by microsecond resolution, I would say this is an implantation error. When we dealing scientific computing, we should never treat a fractional value like an integer. For example when we compare two float value f1 and f2, we never check if (f1==f2) but check if abs(f1-f2) less than an acceptable range.

From: Steven Ring [mailto:notifications@github.com] Sent: Thursday, 9 July 2015 12:14 PM To: GeoscienceAustralia/agdc Subject: [agdc] Durable/invariant acquisition identifier (#69)

Since its inception, WOfS has encountered problems with the lack of a stable identifier for "tiles/acquisitions". WOfS currently uses the AGDC convention of "start timestamp" which is available via the API and also encoded into the dataset filename.

One problem has been that there is fuzziness associated with this value. In the past, should a tile/acquisition be re-ingested the "start timestamp" is not guaranteed to retain the same value. Any application running incremental updates may incorrectly interpret slightly changed start times as new tile.

The micorsecond resolution of these timestamps also causes problems with string representation of the timestamp changing depending on the value of (or absence of) the microsecond component. A general solution to this problem requires some REGEX gymnastics (e.g.

https://github.com/smr547/ga-neo-nfrip/blob/luigi_api_refactor/wofs/timeparser.py)

Anyway, I'd like to suggest that we consider using a combination of satellite_id, "orbit number" and AGCD "cell_ID" as a durable acquisition identifier within any particular dataset type.

This ID has the highly desirable property of not changing between ingestions. The orbit number is readily available from TLEs and could be computed during pre-ingestion processing (if it is not already).

The identifier's value can also be easily predetermined. i.e. We can reliably state what acquisition IDs will be generated during the next (or any) orbit of LS8. This has significant benefits when developing QA procedures (i.e. identifiying gaps in our data).

Obviously it also eliminates the possibility of data duplication.

As an example

LS5_TM_NBAR150-034_1991-01-09T23-03-58.241019.tif

might become

LS5_NBAR150-034_23456.tif

— Reply to this email directly or view it on GitHub< https://github.com/GeoscienceAustralia/agdc/issues/69>.

Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is intended only for the person or entity to which it is addressed. If you are not the intended recipient, then you have received this e-mail by mistake and any use, dissemination, forwarding, printing or copying of this e-mail and its file attachments is prohibited. The security of emails transmitted cannot be guaranteed; by forwarding or replying to this email, you acknowledge and accept these risks.


— Reply to this email directly or view it on GitHub < https://github.com/GeoscienceAustralia/agdc/issues/69#issuecomment-121096061>

.

— Reply to this email directly or view it on GitHub< https://github.com/GeoscienceAustralia/agdc/issues/69#issuecomment-121105733>.

Geoscience Australia Disclaimer: This e-mail (and files transmitted with it) is intended only for the person or entity to which it is addressed. If you are not the intended recipient, then you have received this e-mail by mistake and any use, dissemination, forwarding, printing or copying of this e-mail and its file attachments is prohibited. The security of emails transmitted cannot be guaranteed; by forwarding or replying to this email, you acknowledge and accept these risks.


— Reply to this email directly or view it on GitHub https://github.com/GeoscienceAustralia/agdc/issues/69#issuecomment-121116569 .

Steven Ring Software Engineer BSc MIT (ANU) m: 0417 495 268 f: +61 2 6100 9273 s: stevering