LRO wea files not parsing correctly

Sierra-MC commented 2 months ago

From @gaelccc review

The output I get from reading a wea file (e.g., the ones here) is wrong. Somehow decimal digits are not assigned to the integer part, but rather to the next columns. E.g., the number 0958.9 is read as 958 and 9 is assigned to the next column. The following column will read 9 N where N is the number supposed to be in that column (e.g., if N=23.4 then 9 N= 9 23). The output type, indeed is str.

joss-review

cmillion commented 2 months ago

Hi @gaelccc. We are unable to reproduce the issue as you have described it.

Specifically using usps_2021279_0648.wea and usps_2021279_0648.lbl and the FMT file pulled from the archive, pdr produces the following table:

In [17]: d1 = pdr.read('data/usps_2021279_0648.wea')

In [18]: d1['WEAREC_TABLE']
Out[18]:
NAME  HOUR  MINUTE  TEMPERATURE  BAROMETRIC PRESSURE  RELATIVE HUMIDITY  WIND SPEED
0        6      49         15.9                987.5               39.6          29
1        6      59         20.4                985.4               28.0          24
2        7       9         20.4                985.1               28.5          18
3        7      19         20.7                985.1               27.4          25
4        7      29         20.4                985.1               28.3          19
5        7      39         20.5                985.1               28.2          27
6        7      49         20.5                985.1               27.8          17

Which appears to agree completely with the values (absent padding zeroes) if I simply open the same wea file up with vi e.g.

20211006 279 USPS
06:49 015.9 0987.5 039.6 029
06:59 020.4 0985.4 028.0 024
07:09 020.4 0985.1 028.5 018
07:19 020.7 0985.1 027.4 025
07:29 020.4 0985.1 028.3 019
07:39 020.5 0985.1 028.2 027
07:49 020.5 0985.1 027.8 017

We confirmed this behavior with several other observations as well.

Can you please provide the specific sequences of commands that you used to generate this strange result and point us to copies of the specific WEA, LBL, and FMT files that you are using?

Crossreferencing: openjournals/joss-reviews#7256

gaelccc commented 2 months ago

Thank you for the prompt reply. Indeed trying on this file works well for me too. I incurred in this issue while trying to read an older file: https://pds-geosciences.wustl.edu/lro/lro-l-rss-1-tracking-v1/lrors_0001/data/wea/lro_es_107/2021279/ku2s_2013070_0916.wea and associated .lbl. I am using the same .FMT (lro_wea_rec_ws3.fmt) file for both cases. This is the output I get for this minimal example:

fname = 'ku2s_2013070_0916.lbl'
data=pdr.read(fname)
data['WEAREC_TABLE']

OUT:
NAME  HOUR  MINUTE  TEMPERATURE  BAROMETRIC PRESSURE RELATIVE HUMIDITY  \
0        9      16         -8.0                  958             9 057   
1        9      21         -8.0                  958             9 057   
2        9      26         -8.0                  958             9 057   
3        9      31         -8.0                  958             8 056   
4        9      36         -8.0                  958             8 055   
5        9      41         -8.0                  958             8 056   
6        9      46         -8.0                  958             8 056   

NAME WIND SPEED  
0           1 0  
1           3 0  
2           0 0  
3           2 0  
4           7 0  
5           4 0  
6           NaN

This is what the .wea file looks like:

20130311 070 KU2S
09:16  -08.4  0958.9 057.1 011
09:21  -08.3  0958.9 057.3 008
09:26  -08.2  0958.9 057.0 007
09:31  -08.1  0958.8 056.2 011
09:36  -08.4  0958.8 055.7 011
09:41  -08.4  0958.8 056.4 008
09:46  -08.4  0958.8 056.4 010

Sierra-MC commented 2 months ago

@gaelccc I'm getting a "webpage is currently unavailable" error for that link and don't see similarly named file in that folder here: https://pds-geosciences.wustl.edu/lro/lro-l-rss-1-tracking-v1/lrors_0001/data/wea/lro_es_107/2021279/

Is it possible the format file was updated since that file was available?

gaelccc commented 2 months ago

@Sierra-MC I somehow pasted the wrong link. Here is the correct one: https://pds-geosciences.wustl.edu/lro/lro-l-rss-1-tracking-v1/lrors_0001/data/wea/lro_es_06/2013070/

m-stclair commented 2 months ago

Thanks @gaelccc. We looked into this and discovered that the ku2s files in that directory have bad metadata. Like the usps files Sierra linked, these ku2s files specify the LRO_WEA_REC_WS3.FMT file for table structure, and state in their primary labels that they have 29 bytes per row. However, the ku2s files actually have 32 bytes per row, and their column boundaries do not match the column descriptions in the referenced FMT file.

We have not checked to see if this is consistent across all ku2s* files in the LRORS corpus. If it is, we could write a special case for this. However, when we discover data quality issues related to an ongoing mission, we prefer to first reach out to the data providers to let them know about the problem and see if they have the capability to fix it. Do you happen to know the appropriate person to contact about these files?

Sierra-MC commented 2 months ago

After looking into this further I've confirmed the number of bytes per row is not consistent throughout all the ku2s* files. We can implement a special case that simply ignores the format files and parses as delimiter separated tables; preference is still to bring this to the attention of the data providers and have them implement a fix.

gaelccc commented 2 months ago

Indeed, I think it's better to inform the data provider. I sent a message to one of them, let's see. Thanks for digging into this. I think that it would be good, generally speaking, to have some control mechanism to at least verify that the parsed numbers make sense (i.e., the int() or float() operation does not fail, although I'm afraid more complexity is required). I understand that adding this kind of control to all the possible data formats will be definitely complex and require a lot of additional work, so I will not request this for this review. However I suggest you consider some kind of control mechanism for future developments.

m-stclair commented 2 months ago

You make a good point. We perform this kind of check on binary tables/arrays (it's fully built into how we read them), but not ASCII. Basically this is because we have found that people often play very fast and loose with data type specifications for columns/fields in ASCII files. However, it might be time to revisit this. This is not specific to the LRO products, so I have opened a separate issue if you would like to discuss further there.

Leaving this issue open in case Gael gets a response from the data provider.

Sierra-MC commented 2 months ago

@gaelccc I've created a full list of all affected ku2s files in the geo holdings (there are 293 of them). It is possible there are some other wea files that have this issue, but I have only checked the ku2s to forego having to download the entire corpus on my laptop. I'm not sure who you contacted already but perhaps this list would be helpful to them.

affected_files.csv

Sierra-MC commented 1 month ago

Special case added in 64f5e6a7e9d80a1c60aad1e2f71b138713eda6dc , still preferable to have affected files fixed by the data providers, but this special case works for both affected and unaffected files.

MillionConcepts / pdr

LRO wea files not parsing correctly #68