USF-IMARS / imars-etl

:cloud: Tools for `extract` and `load` for IMaRS ETL (Extract, Transform, Load) operations
0 stars 0 forks source link

WV files with same `date_time`s #40

Open 7yl4r opened 5 years ago

7yl4r commented 5 years ago

Currently the database & imars-etl operate under the assumption that files are unique by their date_time (to the microsecond), satellite, & instrument. This seemed reasonable since each satellite instrument should only be able to creating one file at a time.

Unfortunately this assumption may not be true for the worldview satellites. These two files have identical FIRSTLINETIME and STARTTIME defining two different "granules":

/srv/imars-objects/extra_data/WV02/2013.01/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P001.xml
/srv/imars-objects/extra_data/WV02/2013.01/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P002.xml

The GENERATIONTIME does differ between the two, but I am fairly confident that this time is even less likely to be unique. Worldview documentation I have seen (dig. globe, pci), does not go into enough detail to determine if any of these should be unique times, but I thought FIRSTLINETIME would be our best chance.

airflow cannot work without unique times for each granule. If there are indeed multiple granules that cannot be differentiated by timestamp, we need to fake it somehow.

7yl4r commented 5 years ago

@sebastiandig & @mjm8 : I thought the assumption that no two WV2 files would have the exact same datetime was a safe choice, but the two .xml files mentioned above seem to disprove this assumption.

Is there something I am overlooking or can the worldview 2 multispectral instrument really record two images simultaneously?

7yl4r commented 5 years ago

Note that multiple files within the same second is expected and these are differentiated by P001, P002, etc. But this is the first time I have seen two images with the same datetime to the microsecond (2013-01-23 16:36:28.515950). This is a big problem for our airflow pipeline.

mjm8 commented 5 years ago

Where are these filenames? I'm surprised this happened so I want to check out the details.

On Fri, Mar 22, 2019, 11:17 AM Tylar <notifications@github.com wrote:

Note that multiple files within the same second is expected and these are differentiated by P001, P002, etc. But this is the first time I have seen two images with the same datetime to the microsecond (2013-01-23 16:36:28.515950). This is a big problem for airflow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/USF-IMARS/imars-etl/issues/40#issuecomment-475683369, or mute the thread https://github.com/notifications/unsubscribe-auth/Af6OKGr_qLjeN_lX9HXplPGkhsE6KAW5ks5vZQIOgaJpZM4cD62x .

7yl4r commented 5 years ago

The two xml files are:

extra_data/WV02/2013.01/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P001.xml
extra_data/WV02/2013.01/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P002.xml

I did not look at the corresponding .ntfs or other files yet, but the xml files had several other differences but the same datetime. Pretty weird, right?

mjm8 commented 5 years ago

Sorry for the delay, just got back from a short vacation and finally pulled up those images. Strangely enough, neither contains the usual cluster of files, but instead each only has an NTF and XML file, which is fine for our purposes. Interestingly, these images are of Tampa Bay, but the second (P002) is just a sliver of P001, containing no additional information or geographic coverage. In other words, it can be deleted. I'd like to know if this is the case for more images - do you have any other examples?

On Sat, Mar 23, 2019 at 9:42 PM Tylar notifications@github.com wrote:

The two xml files are:

extra_data/WV02/2013.01/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P001.xml extra_data/WV02/2013.01/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P002.xml

I did not look at the corresponding .ntfs or other files yet, but the xml files had several other differences but the same datetime. Pretty weird, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/USF-IMARS/imars-etl/issues/40#issuecomment-475920090, or mute the thread https://github.com/notifications/unsubscribe-auth/Af6OKFDQFdusmYy8iquLtyFTb7Pu6C2Kks5vZtf3gaJpZM4cD62x .

-- Matt McCarthy, Ph.D. Biological Oceanography College of Marine Science University of South Florida 140 7th Avenue South St Petersburg, FL 33701-5016 727-553-1186

7yl4r commented 5 years ago

Strangely enough, neither contains the usual cluster of files, but instead each only has an NTF and XML file, which is fine for our purposes.

This part of the weirdness is my doing - I have been cutting some corners to rush through the Azure credits. The rest of the files are compressed in an archive elsewhere.

Thanks for taking a look. I'm very relieved to know I can just delete one of them rather than re-tool the whole system.

I don't have any more examples but there are many files left to process. I will let you know if I come across any more like this.

On Wed, Mar 27, 2019 at 1:34 PM mjm8 notifications@github.com wrote:

Sorry for the delay, just got back from a short vacation and finally pulled up those images. Strangely enough, neither contains the usual cluster of files, but instead each only has an NTF and XML file, which is fine for our purposes. Interestingly, these images are of Tampa Bay, but the second (P002) is just a sliver of P001, containing no additional information or geographic coverage. In other words, it can be deleted. I'd like to know if this is the case for more images - do you have any other examples?

On Sat, Mar 23, 2019 at 9:42 PM Tylar notifications@github.com wrote:

The two xml files are:

extra_data/WV02/2013.01/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P001.xml

extra_data/WV02/2013.01/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P002.xml

I did not look at the corresponding .ntfs or other files yet, but the xml files had several other differences but the same datetime. Pretty weird, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/USF-IMARS/imars-etl/issues/40#issuecomment-475920090 , or mute the thread < https://github.com/notifications/unsubscribe-auth/Af6OKFDQFdusmYy8iquLtyFTb7Pu6C2Kks5vZtf3gaJpZM4cD62x

.

-- Matt McCarthy, Ph.D. Biological Oceanography College of Marine Science University of South Florida 140 7th Avenue South St Petersburg, FL 33701-5016 727-553-1186

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/USF-IMARS/imars-etl/issues/40#issuecomment-477156459, or mute the thread https://github.com/notifications/unsubscribe-auth/ABAK_le1z4pjNBtmxVpvamsRfWW0vgD9ks5va3NfgaJpZM4cD62x .

--

===========================================

Tylar Murray http://tylar.info, Ph.D.

IMaRS http://imars.marine.usf.edu/ Research Systems & Software Engineer

USF CMS http://marine.usf.edu - KRC 3119-B

------------------------------------------

schedule : g-calendar

https://calendar.google.com/calendar?cid=NWRuOHRubTBmczlmZjN0cTVhMGczbnBqbXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ, youcanbookme https://7yl4r.youcanbook.me/

===========================================

mjm8 commented 5 years ago

Sounds good!

On Wed, Mar 27, 2019 at 11:09 AM Tylar notifications@github.com wrote:

Strangely enough, neither contains the usual cluster of files, but instead each only has an NTF and XML file, which is fine for our purposes.

This part of the weirdness is my doing - I have been cutting some corners to rush through the Azure credits. The rest of the files are compressed in an archive elsewhere.

Thanks for taking a look. I'm very relieved to know I can just delete one of them rather than re-tool the whole system.

I don't have any more examples but there are many files left to process. I will let you know if I come across any more like this.

On Wed, Mar 27, 2019 at 1:34 PM mjm8 notifications@github.com wrote:

Sorry for the delay, just got back from a short vacation and finally pulled up those images. Strangely enough, neither contains the usual cluster of files, but instead each only has an NTF and XML file, which is fine for our purposes. Interestingly, these images are of Tampa Bay, but the second (P002) is just a sliver of P001, containing no additional information or geographic coverage. In other words, it can be deleted. I'd like to know if this is the case for more images - do you have any other examples?

On Sat, Mar 23, 2019 at 9:42 PM Tylar notifications@github.com wrote:

The two xml files are:

extra_data/WV02/2013.01/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P001.xml

extra_data/WV02/2013.01/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P002.xml

I did not look at the corresponding .ntfs or other files yet, but the xml files had several other differences but the same datetime. Pretty weird, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/USF-IMARS/imars-etl/issues/40#issuecomment-475920090 , or mute the thread <

https://github.com/notifications/unsubscribe-auth/Af6OKFDQFdusmYy8iquLtyFTb7Pu6C2Kks5vZtf3gaJpZM4cD62x

.

-- Matt McCarthy, Ph.D. Biological Oceanography College of Marine Science University of South Florida 140 7th Avenue South St Petersburg, FL 33701-5016 727-553-1186

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <https://github.com/USF-IMARS/imars-etl/issues/40#issuecomment-477156459 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ABAK_le1z4pjNBtmxVpvamsRfWW0vgD9ks5va3NfgaJpZM4cD62x

.

--

===========================================

Tylar Murray http://tylar.info, Ph.D.

IMaRS http://imars.marine.usf.edu/ Research Systems & Software

Engineer

USF CMS http://marine.usf.edu - KRC 3119-B

------------------------------------------

schedule : g-calendar

< https://calendar.google.com/calendar?cid=NWRuOHRubTBmczlmZjN0cTVhMGczbnBqbXNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ

, youcanbookme https://7yl4r.youcanbook.me/

===========================================

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/USF-IMARS/imars-etl/issues/40#issuecomment-477200559, or mute the thread https://github.com/notifications/unsubscribe-auth/Af6OKHhZsojP5TfpnzbCUWrTGe-s155vks5va4nAgaJpZM4cD62x .

-- Matt McCarthy, Ph.D. Biological Oceanography College of Marine Science University of South Florida 140 7th Avenue South St Petersburg, FL 33701-5016 727-553-1186

7yl4r commented 5 years ago

Well... Here are ~400 more examples of this: http://imars-physalis.marine.usf.edu:3000/queries/14-unexpected-duplicates (email me if you need the login info).

All files there have the same datetime (to the ms despite that not showing) and area, but the content of the files is different. That query shows files in pairs with filepaths & hashes in one row. As an example, the following two files are in conflict:

/srv/imars-objects/west_fl_pen/ntf_wv2_m1bs/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P001.ntf
/srv/imars-objects/west_fl_pen/ntf_wv2_m1bs/WV02_20130123163628_0000000000000000_13Jan23163628-M1BS-059048321010_01_P002.ntf

I think those two are a similar issue as the previous example.

Another case seems to be where the pass ID is also the same, but the "catalog ID" (I think is what that is...) differs. An example of this case:

/srv/imars-objects/monroe/ntf_wv2_m1bs/WV02_20160507155056_0000000000000000_16May07155056-M1BS-058523213010_01_P002.ntf
/srv/imars-objects/monroe/ntf_wv2_m1bs/WV02_20160507155056_0000000000000000_16May07155056-M1BS-058523214010_01_P002.ntf

We need to find a way to resolve these conflicts before airflow can process these files.

7yl4r commented 5 years ago

I am comparing the following 2 files in /srv/imars-objects/west_fl_pen:

WV02_20170512163422_0000000000000000_17May12163422-M1BS-059145537010_01_P010.ntf
WV02_20170512163422_0000000000000000_17May12163422-M1BS-058943203010_01_P010.ntf

They are the same size. Looking at them in QGIS, they look the same. The metadata output from gdalinfo is identical except for NITF_FDT, NITF_FTITLE, NITF_IID2, which all seem to be variations on the filename:

tylar@XT3:~$ gdalinfo WV02_20170512163422_0000000000000000_17May12163422-M1BS-059145537010_01_P010.ntf >> info1
tylar@XT3:~$ gdalinfo WV02_20170512163422_0000000000000000_17May12163422-M1BS-058943203010_01_P010.ntf >> info2
tylar@XT3:~$ diff info1 info2 
2c2
< Files: WV02_20170512163422_0000000000000000_17May12163422-M1BS-059145537010_01_P010.ntf
---
> Files: WV02_20170512163422_0000000000000000_17May12163422-M1BS-058943203010_01_P010.ntf
31c31
<   NITF_FDT=20190207231221
---
>   NITF_FDT=20190103163752
51c51
<   NITF_FTITLE=17MAY12163422-M1BS-059145537010_01_P010.NTF
---
>   NITF_FTITLE=17MAY12163422-M1BS-058943203010_01_P010.NTF
60c60
<   NITF_IID2=12MAY17WV021200017MAY12163422-M1BS-059145537010_01_P010
---
>   NITF_IID2=12MAY17WV021200017MAY12163422-M1BS-058943203010_01_P010

I am somewhat comfortable concluding that these two have identical data and deleting one of the files. The filename being embedded in the file itself makes this difficult to test for conclusively. Questions:

  1. Which file should we keep? I guess it doesn't matter.
  2. Are we 100% confident that NITF files with identical metadata excluding the three fields above and identical file sizes are the same?

Thoughts @mjm8 ?

mjm8 commented 5 years ago

Try as I might, I can't find a single meaningful difference between the two images. I pulled them up in ENVI, linked them, and checked spectral profiles - everything was identical. However, in the metadata it looks like maybe they were produced at different times, and the Date Modified column in the Windows folder indicates they were modified a day apart. Could the same image have been downloaded twice on consecutive days? In any case, I think it's safe to just delete the second of the two.

On Tue, Apr 23, 2019 at 11:25 AM Tylar notifications@github.com wrote:

I am comparing the following 2 files in /srv/imars-objects/west_fl_pen:

WV02_20170512163422_0000000000000000_17May12163422-M1BS-059145537010_01_P010.ntf WV02_20170512163422_0000000000000000_17May12163422-M1BS-058943203010_01_P010.ntf

They are the same size. Looking at them in QGIS, they look the same. The metadata output from gdalinfo is identical except for NITF_FDT, NITF_FTITLE, NITF_IID2, which all seem to be variations on the filename:

tylar@XT3:~$ gdalinfo WV02_20170512163422_0000000000000000_17May12163422-M1BS-059145537010_01_P010.ntf >> info1 tylar@XT3:~$ gdalinfo WV02_20170512163422_0000000000000000_17May12163422-M1BS-058943203010_01_P010.ntf >> info2 tylar@XT3:~$ diff info1 info2 2c2 < Files: WV02_20170512163422_0000000000000000_17May12163422-M1BS-059145537010_01_P010.ntf

Files: WV02_20170512163422_0000000000000000_17May12163422-M1BS-058943203010_01_P010.ntf 31c31 < NITF_FDT=20190207231221

NITF_FDT=20190103163752 51c51 < NITF_FTITLE=17MAY12163422-M1BS-059145537010_01_P010.NTF

NITF_FTITLE=17MAY12163422-M1BS-058943203010_01_P010.NTF 60c60 < NITF_IID2=12MAY17WV021200017MAY12163422-M1BS-059145537010_01_P010

NITF_IID2=12MAY17WV021200017MAY12163422-M1BS-058943203010_01_P010

I am somewhat comfortable concluding that these two have identical data and deleting one of the files. The filename being embedded in the file itself makes this difficult to test for conclusively. Questions:

  1. Which file should we keep? I guess it doesn't matter.
  2. Are we 100% confident that NITF files with identical metadata excluding the three fields above and identical file sizes are the same?

Thoughts @mjm8 https://github.com/mjm8 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/USF-IMARS/imars-etl/issues/40#issuecomment-485851959, or mute the thread https://github.com/notifications/unsubscribe-auth/AH7I4KEUSGJBTTC3BUDZY3DPR4TALANCNFSM4HAPVWYQ .

-- Matt McCarthy, Ph.D. Biological Oceanography College of Marine Science University of South Florida 140 7th Avenue South, KRC 3117 St Petersburg, FL 33701-5016 727-553-1186

7yl4r commented 5 years ago

Could the same image have been downloaded twice on consecutive days?

Yes I think this is exactly what happened. @sebastiandig warned me it was likely we would get some duplicates; I just wasn't expecting the files to be different. Since the metadata in the file differs, it is a bit harder to identify duplicates. I am going to move forward with removing the older file if the metadata differs only in this way and the files are the same size. I would like to have a more robust check, but hopefully this will be good enough.

mjm8 commented 5 years ago

Sounds like a good plan.

On Wed, Apr 24, 2019, 10:01 AM Tylar notifications@github.com wrote:

Could the same image have been downloaded twice on consecutive days?

Yes I think this is exactly what happened. @sebastiandig https://github.com/sebastiandig warned me it was likely we would get some duplicates; I just wasn't expecting the files to be different. Since the metadata in the file differs, it is a bit harder to identify duplicates. I am going to move forward with removing the older file if the metadata differs only in this way and the files are the same size. I would like to have a more robust check, but hopefully this will be good enough.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/USF-IMARS/imars-etl/issues/40#issuecomment-486245752, or mute the thread https://github.com/notifications/unsubscribe-auth/AH7I4KDYOI4KHADUHSOE5NTPSBR4BANCNFSM4HAPVWYQ .

7yl4r commented 5 years ago

As of USF-IMARS/imars_dags/7d3d5bfb331bc30401182244537a7b11eb8d5167 the infrastructure to detect these and do something about them should be in place. There were some big changes involved so I am going to let it run on the test server for a while before merging into production and flipping the "actually delete files" switch.