Error Loading Files with Repeated Station Identifiers

AtmosphericPhysicist commented 4 years ago

It has been discovered that in the VSE16 and 17 datasets, there are LMA stations that shared a single letter station identifier when two different arrays were combined. Although not necessarily an issue for most analyses, this issue prevents pyxlma from loading in the data. The following error is thrown when loading in a .dat.gz file when two identical station identifiers are present:

from pyxlma.lmalib.io.read import lmafile

lma = lmafile(r'C:\Users\tyjwe\Desktop\NALMA_170430_191000_0600.dat.gz').readfile() Traceback (most recent call last):

File "", line 1, in lma = lmafile(r'C:\Users\tyjwe\Desktop\NALMA_170430_191000_0600.dat.gz').readfile()

File "C:\Users\tyjwe\Desktop\Code_Repos\xlma-python\pyxlma\lmalib\io\read.py", line 104, in readfile lmad.insert(8,items,(mask_to_int(lmad["mask"])>>index)%2)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py", line 3473, in insert allow_duplicates=allow_duplicates)

File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals\managers.py", line 1149, in insert raise ValueError('cannot insert {}, already exists'.format(item))

ValueError: cannot insert Y, already exists'''

How should we resolve this issue in pyxlma?

My initial thought would be to simply include an 'if' statement that automatically alters the problem station identifier to another letter. However, if the problem station is not the last in the file, then the issue and fix could repeat and many stations would have altered identifiers, making it confusing to discuss the results and compare to other data files/years. Thoughts?

An example trouble file has also been attached: NALMA_170430_191000_0600.dat.gz

vbalderdash commented 4 years ago

Is the mask order always the reverse of the station information? Since the logic pulls the station symbol from a list of objects, is there any reason why we need to keep it limited to one character? We could append a digit to non-unique characters or we could use the list of stations to substitute the repeated one values with the full station name string. If the second, maybe it makes more sense to change all of the identifiers to the full station names in these (or all) cases?

vbalderdash commented 4 years ago

That file was a good test. Not only are there two 'Y's there are two 'Water's. I am updating the read function to keep the columns in [symbol]_[station] format which I hope will keep them unique.

deeplycloudy commented 4 years ago

Thanks, @vbalderdash!

On Sep 2, 2019, at 20:40, vbalderdash notifications@github.com wrote:

Closed #8.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

deeplycloudy / xlma-python

Error Loading Files with Repeated Station Identifiers #8