LinkedEarth / pylipd

Development repository for Python LiPD utilities
https://pylipd.readthedocs.io/en/latest/
Apache License 2.0
2 stars 0 forks source link

archiveType parsing error #55

Open fzhu2e opened 2 months ago

fzhu2e commented 2 months ago

In certain cases there could be some parsing error. Not sure if it's only limited to archiveType.

For example, when load in MS07LSPG.lpd from Iso2k, the archiveType appears to be "stott.2007.md98-2176", which is actually the "collectionName" according to the raw metadata.

image

Please see attached notebook (data included) for the details: debug.ipynb.zip

varunratnakar commented 2 months ago

I've added in a Fix.. Please try it out

fzhu2e commented 2 months ago

I've added in a Fix.. Please try it out

I think it works now regarding archiveType. Thanks for the prompt response!

Another related issue: There are some records have "paleoData_values" while don't have "year".

image

IC066THPR.lpd is attached as an example IC06THPR.lpd.zip

varunratnakar commented 2 months ago

The issue seems to be that those variables have the name "year start".

We map variables to the standard variables based on the synonyms, in which case "year start" would be mapped to "year". So now I've changed the code to do that, and you should have "year" available

fzhu2e commented 2 months ago

Thanks Varun! With the update, IC066THPR is good now but the issue remains for the others (another one is attached).

image

CO17WUBO1B.lpd.zip

varunratnakar commented 2 months ago

Hmm, strange issue with this one..

According to https://lipdverse.org/vocabulary/paleodata_variablename/#uncertainty , which is what I used to create synonyms, The variable "uncertainty" has a synonym called "age". Not sure if this should be the case, but in any case, because of this.. the "age" variable mapping is overruled by the mapping to the "uncertainty" variable.

For now, I'm removing this synonym for uncertainty to fix the issue.

Also, you'll have to change the code to check for the column "age" as well.

fzhu2e commented 2 months ago

It's indeed strange. The "age" column works now, but just FYI, for certain records, even with variables not "uncertainty", they don't have "year" and only have "age".

(I'm not showing "age" below but they exist)

image
varunratnakar commented 2 months ago

I don't understand if this is still an issue ? If so, could we elaborate ?

fzhu2e commented 2 months ago

I don't understand if this is still an issue ? If so, could we elaborate ?

Well, it's practically inconvenient to get "age" instead of "year". Is it the metadata issue or should I expect that pylipd can do the inference?

varunratnakar commented 2 months ago

Hmm, we could add a function in pylipd that asks for a list of all temporal variables (age, year, etc). Any ideas @khider ?

khider commented 2 months ago

You should get either "age" or "year". "year" shouldn't be replaced by "age".

fzhu2e commented 2 months ago

You should get either "age" or "year". "year" shouldn't be replaced by "age".

The example data is Iso2k so "year" makes more sense than "age"? (The numbers are something like "1415.8" so definitely not age). Do you think this is just errors in the metadata when it was reported?

khider commented 2 months ago

We don't enforce it at the LiPD level. So maybe. Have you tried to open the file by unzipping it and looking at the JSON?

fzhu2e commented 2 months ago

We don't enforce it at the LiPD level. So maybe. Have you tried to open the file by unzipping it and looking at the JSON?

Yes, but I'm not familiar with the data and cannot judge. It seems that the metadata uses "Age" though, and I'm not sure whether it is correct or not. It will make a difference with two different interpretations (year vs age).

The raw data is attached: data.zip

The time axis starts from 1949.2917, which seems like "year" to me. They used units as "AD" which seems inconsistent with "Age"?

"paleoData": [
    {
      "measurementTable": [
        {
          "tableName": "paleo1measurement1",
          "missingValue": "NaN",
          "columns": [
            {
              "measurementTableName": "measurementTable1",
              "TSid": "WEBa97612cc2",
              "description": "carbonate",
              "hasMaxValue": 56.0417,
              "hasMeanValue": 28.375,
              "hasMedianValue": 28.375,
              "hasMinValue": 0.7083,
              "iso2kPrimaryTimeseries": "FALSE",
              "measurementTableMD5": "f49183051b8467faded7a7bfad4fb107",
              "units": "AD",
              "variableName": "age",
              "hasChron": "0",
              "hasPaleoDepth": "0",
              "iso2kCertification": "DMT (1/14/19)",
              "measurementMaterial": "coral",
              "QCCertification": "DMT (1/14/19)",
              "unitsOriginal": "years",
              "variableNameOriginal": "Age",
              "interpretation": [
khider commented 1 month ago

Yeah, we don't reason if inconsistent right now. The variable name should be year in the file itself.

fzhu2e commented 1 month ago

Yeah, we don't reason if inconsistent right now. The variable name should be year in the file itself.

Gotcha. Thanks!