OpenDendro / dplPy

The Dendrochronology Program Library for Python
https://opendendro.org/python/
GNU General Public License v3.0
6 stars 8 forks source link

readers.py chokes on rwl when there are repeated series IDs #33

Open kanchukaitis opened 2 years ago

kanchukaitis commented 2 years ago

This is a common problem with dplR too - readers.py needs a way to deal with repeated sample IDs (either a verbose warning or a modification of the sample ID (e.g. adding an underscore)). In general we need to test readers.py with a variety of .rwl files (not just the idealized test ones) and we need informative error messages viet001.rwl.txt

AndyBunn commented 2 years ago

The error message should make the user feel shame about submitting a file with duplicated series

CosiMichele commented 2 years ago

With @ifeoluwaale back, we can address this. @kanchukaitis do you have an error message to see? Or is the file you have uploaded here an example of an error causing input file?

Edit: are there any other specific sample files from ITRDB that we can look at? So @ifeoluwaale can break readers.py some more (and fix all the problems)

kanchukaitis commented 2 years ago

Hi @CosiMichele @ifeoluwaale - yeah, the viet001.rwl is giving an error. Ideally, we should now shoot for being able to acquire ANY ITRDB rwl file and read it in (not just the 3 test files we have) - even if there is ultimately a failure, we need to have verbose error output of where the failure is occurring - but yes, let's start with the attached viet001.rwl and go from there

kanchukaitis commented 2 years ago

@CosiMichele @ifeoluwaale - here are some others to try with various challenges:

https://www.ncei.noaa.gov/pub/data/paleo/treering/measurements/asia/th001.rwl https://www.ncei.noaa.gov/pub/data/paleo/treering/measurements/northamerica/canada/cana157.rwl https://www.ncei.noaa.gov/pub/data/paleo/treering/measurements/northamerica/canada/cana323.rwl

All three have some classic challenges typical of some of the LDEO rwl files (particular in series names)

AndyBunn commented 2 years ago

Another one to code for is the rare (but real) case where years go back before 1 CE. The negative year subscript causes issues with series names. E.g,

https://www.ncei.noaa.gov/pub/data/paleo/treering/measurements/northamerica/usa/ca667.rwl

From: Kevin Anchukaitis @.> Date: Tuesday, August 23, 2022 at 10:22 AM To: OpenDendro/dplPy @.> Cc: Andy Bunn @.>, Comment @.> Subject: Re: [OpenDendro/dplPy] readers.py chokes on rwl when there are repeated series IDs (Issue #33)

@CosiMichelehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCosiMichele&data=05%7C01%7Cbunna%40wwu.edu%7C3c33f0811a7d4d063c2008da852c0933%7Cdc46140ce26f43efb0ae00f257f478ff%7C0%7C0%7C637968721441599997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=P42PfOgCqK8ObCtmqhz%2FxuPThFnTfSGrkqUxNWzTRGM%3D&reserved=0 @ifeoluwaalehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fifeoluwaale&data=05%7C01%7Cbunna%40wwu.edu%7C3c33f0811a7d4d063c2008da852c0933%7Cdc46140ce26f43efb0ae00f257f478ff%7C0%7C0%7C637968721441599997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QaldtOZeq9afrLzjworY4cerkvp%2FLiUQoCGj7O89Bfs%3D&reserved=0 - here are some others to try with various challenges:

https://www.ncei.noaa.gov/pub/data/paleo/treering/measurements/asia/th001.rwlhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ncei.noaa.gov%2Fpub%2Fdata%2Fpaleo%2Ftreering%2Fmeasurements%2Fasia%2Fth001.rwl&data=05%7C01%7Cbunna%40wwu.edu%7C3c33f0811a7d4d063c2008da852c0933%7Cdc46140ce26f43efb0ae00f257f478ff%7C0%7C0%7C637968721441599997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=IOCHkzmp7HpuY8S3Tw8VhIp1y9kCR58lm68qirlJSQE%3D&reserved=0 https://www.ncei.noaa.gov/pub/data/paleo/treering/measurements/northamerica/canada/cana157.rwlhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ncei.noaa.gov%2Fpub%2Fdata%2Fpaleo%2Ftreering%2Fmeasurements%2Fnorthamerica%2Fcanada%2Fcana157.rwl&data=05%7C01%7Cbunna%40wwu.edu%7C3c33f0811a7d4d063c2008da852c0933%7Cdc46140ce26f43efb0ae00f257f478ff%7C0%7C0%7C637968721441599997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vmVy2dRVt5W8oNuDKv%2BcGfLlV3I%2BuFMYrsBrxgpX%2F7U%3D&reserved=0 https://www.ncei.noaa.gov/pub/data/paleo/treering/measurements/northamerica/canada/cana323.rwlhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ncei.noaa.gov%2Fpub%2Fdata%2Fpaleo%2Ftreering%2Fmeasurements%2Fnorthamerica%2Fcanada%2Fcana323.rwl&data=05%7C01%7Cbunna%40wwu.edu%7C3c33f0811a7d4d063c2008da852c0933%7Cdc46140ce26f43efb0ae00f257f478ff%7C0%7C0%7C637968721441599997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=awfIr3gTVTE9sOYUR0BoUUDcnVOIUWhriJfiNADwuus%3D&reserved=0

All three have some classic challenges typical of some of the LDEO rwl files (particular in series names)

— Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FOpenDendro%2FdplPy%2Fissues%2F33%23issuecomment-1224391014&data=05%7C01%7Cbunna%40wwu.edu%7C3c33f0811a7d4d063c2008da852c0933%7Cdc46140ce26f43efb0ae00f257f478ff%7C0%7C0%7C637968721441599997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=RuAoY6w0TKiz9E3ra7UzCV%2FyR2lOF5NAF9lJLDriyqg%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAC7UCXLJ5VNDGMO2QQGQQ63V2UCEZANCNFSM52AH65HQ&data=05%7C01%7Cbunna%40wwu.edu%7C3c33f0811a7d4d063c2008da852c0933%7Cdc46140ce26f43efb0ae00f257f478ff%7C0%7C0%7C637968721441599997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9l3zk2NITx0kt%2FyhkQl1ICO5Nz5%2FHAkDnL8povWWBo4%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

kanchukaitis commented 1 year ago

@ifeoluwaale is looking into whether/how we solved this before we close it. Procedure would be (1) identify repeated sample identifications, (2) warn/yell at user, and then optionally (3) rename one or more series in a predictable way but not using common core IDs (A, B, C ... etc. risk making the problem worse) - opinions @AndyBunn about the best way to deal with this?