International-Soil-Radiocarbon-Database / ISRaD

Repository for the development and release of ISRaD data and tools
https://international-soil-radiocarbon-database.github.io/ISRaD/
24 stars 15 forks source link

Treat synthesis #166

Closed aahoyt closed 1 year ago

aahoyt commented 5 years ago

This issue is to keep track of outstanding work to complete the Treat synthesis.

Big picture: -Site lat/lon issue in progress -many of the datasets are not found in file S1, which has the values necessary to fill the required column lyr_observation_y, so these datasets are failing QAQC. -errors/incompatibilities in the Treat data (will need manual curation)

Minor/technical fixes outstanding -data based on Pb210 should be removed. Need to filter based on "Method comm" in S3. Remove data entry row if "Method comm" == "210Pb" -"lyr_all_org_neg" = "yes" for all entries in this dataset -Is sample thickness included into the depth ranges? Do we care? (check with Claire) -Issue with ages: conversion from ages in years to fraction modern got lost & needs to be updated/checked. Conversion is needed (see below), then data should be entered in the fraction_modern columns rather than 14c columns. -radiocarbon data matched to "sd" should be matched to "sigma". New filled columns should be: frc_fraction_modern frc_fraction_modern_sigma lyr_fraction_modern lyr_fraction_modern_sigma

To convert from ages to fraction modern (FAQ under Contributing/Radiocarbon Data: https://international-soil-radiocarbon-database.github.io/ISRaD/template_faq/) fraction modern = exp(-8033age) fraction modern error (sigma) = fraction modern measured age errror / 8033

where ages are in years. Note in the Treat synthesis ages in kYears, so first multiply *1000

greymonroe commented 5 years ago

Is sample thickness included into the depth ranges?

Yes, it is used to calculate lyr_top in S3

conversion from ages in years to fraction modern got lost & needs to be updated/checked. Conversion is needed (see below), then data should be entered in the fraction_modern columns rather than 14c columns.

@aahoyt can we set up a skype call to talk about this. Want to make sure i do this correctly. Are you free anytime today (Friday, Feb 8)?

aahoyt commented 5 years ago

great on thickness!

and yes, @greymonroe I'll message you on a skype.

greymonroe commented 5 years ago

Bhiry and Robert, 2006 not in S1 Dredge and Mott, 2003 not in S1 Jones et al., 2009 not in S1 Jones et al., 2012 elevation appears to have been entered incorrectly Jones et al., 2014 not in S1 Kremenetski, 1997 depth missing Kuhry, 1997 not in S1 Lavoie and Payette, 1995 not in S1 Loisel et al., 2014 values greater than accepted max in lyr_bd_samp MacDonald, 1983 not in S1 Myers-Smith et al., 2008 Duplicate layer row identified. I think it has to do with reporting two different labs but im not sure… Nichols, 1967 not in S1 O'Donnell et al., 2011 values smaller than accepted min in lyr_bd_samp column Oksanen, 2002 not in s1 Panova et al., 2010 not in s1 Payette, 1988 not in s1 Vardy et al., 1998; 2000 values smaller than accepted min in lyr_bd_samp column Werner et al., 2010 not in s1 Zibulski et al., 2013 not in s1 Zoltai and Johnson, 1985 not in s1 Zoltai et al., 2000 not in s1 Zoltai, 1993 not in s1 Zuidhoff and Kolstrup, 2000 not in s1

greymonroe commented 5 years ago

the fraction_modern formula is returning 0 or Inf for everything...

also, why would there be negative values for treatS3$Age dated [ka]?

aahoyt commented 5 years ago

Negative ages will happen if you convert from fraction modern units to radiocarbon years for fraction modern values >1 ( values >modern) but becomes meaningless. Let's ask Claire what the intent was there.

That's a lot missing from S1!

greymonroe commented 5 years ago

Ok. What about everything being equal to 0? Am I doing the formula correctly?

Heres how im calculating it and an example for ka=2

treatS3_layers$lyr_fraction_modern<-exp(-8033(treatS3_layers$Age dated [ka]1000))

exp(-8033(21000)) [1] 0

aahoyt commented 5 years ago

Sorry my typo! It should be Fm = exp(age/-8033) eg exp((2*1000)/-8033) = 0.78 https://international-soil-radiocarbon-database.github.io/ISRaD/template_faq/

aahoyt commented 5 years ago

Error did not have typo. It is still: error_Fm = Fm * error_age / 8033

so...error_Fm[into sigma in template] = Fm[just calculated above] * errorage[from Claire's column Age std e] / 8033

greymonroe commented 5 years ago

Ok, i fixed the fraction modern equation and it the values make sense now. How do we want to proceed with the files that are passing?

aahoyt commented 5 years ago

Great! Let's have Claire do a final check of the passing files.

greymonroe commented 5 years ago

Btw, unfortunately i had found some issues with the site naming that required me to use the reference_lat_lon for site name. I wrote the code to turn duplicated sites into profiles, but ran into some fatal errors caused by the data. Basically, there were datasets with different site names (for the same site) in S2 and S3. When the code tried to turn these sites into profiles it caused QAQC to not only fail, but crash. I can try another workaround but for now this is where we are.

aahoyt commented 5 years ago

bummer! thanks for the update

greymonroe commented 5 years ago

latest update appears to have fixed most of the issues. Here are the last few.

Zoltai et al., 2000 > coring year not reported in S1 QAQC_Vardy et al., 1998; 2000> values smaller than accepted min in lyr_bd_samp column Panova et al., 2010 > found twice in S1, one of which has NA for coring year Okassen 2002 > missing values where required: site_lat site_long Odonnell 2011 > values smaller than accepted min in lyr_bd_samp column Nichols 1967 > Duplicate layer row identified. Myers smith 2008 > Duplicate layer row identified. ( row/s: 85 ) QAQC_Loisel et al., 2014 > values greater than accepted max in lyr_bd_samp column (rows 40 ) Kremenetski, 1997 > missing values where required: lyr_top lyr_bot QAQC_Jones et al., 2012 > Duplicate profile row identified. ( row/s: 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 )

greymonroe commented 5 years ago

a thought that might be relevant for our discussion about including these data in the first DOI... do we want to have these datasets (the ones passing QAQC) go through some form of expert review before we compile them with everything?

aahoyt commented 5 years ago

Yes, that would be a good idea!

When we checked each of the He synthesis files manually we did find issues, so it would be worth doing something similar for the Treat synthesis before we merge them. So seems like we shouldn't wait on the DOI after all.

cctreat commented 5 years ago

Zoltai et al., 2000 > coring year not reported in S1 Updated Table S1 to include coring year.

QAQC_Vardy et al., 1998; 2000> values smaller than accepted min in lyr_bd_samp column Extracted this value from the paper, 0 seems to be real (or not measured)

Panova et al., 2010 > found twice in S1, one of which has NA for coring year Revised Table S1

Okassen 2002 > missing values where required: site_lat site_long Revised in Table S3

Odonnell 2011 > values smaller than accepted min in lyr_bd_samp column The original datasets that I have show 0.

Nichols 1967 > Duplicate layer row identified. Myers smith 2008 > Duplicate layer row identified. ( row/s: 85 )

QAQC_Loisel et al., 2014 > values greater than accepted max in lyr_bd_samp column (rows 40 ) Some funny problem with the references. Should be Tarnocai 2010 not Loisel. Updated in Tables S1-S3. Real value in datasheet, perhaps a typo from the original entry by the author. Not sure what to do, other than QC removal.

Kremenetski, 1997 > missing values where required: lyr_top lyr_bot Revised data in Table S3

QAQC_Jones et al., 2012 > Duplicate profile row identified. ( row/s: 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ) I don't really understand why this is happening. Is there something going on in the manual data processing? I am updating Table S1 to include the profiles (seemed to be missing many) and editing Tables S2 and S3 to match the coordinates in Table S1.

Treat_S1_edit.xlsx Treat_S2_edit.xlsx Treat_S3_edit.xlsx

cctreat commented 5 years ago

Still outstanding issues:

Nichols 1967 > Duplicate layer row identified. Myers smith 2008 > Duplicate layer row identified. ( row/s: 85 ) QAQC_Jones et al., 2012 > Duplicate profile row identified. ( row/s: 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ) Did I fix all the incorrect attributions to Loisel et al. 2014? (should only be 2 cores).

cctreat commented 5 years ago

Nichols 1967 > Duplicate layer row identified. Sample had 2 replicate 14C analysis, depth was incorrect on 3rd. Revised Table S3 accordingly.

Myers smith 2008 > Duplicate layer row identified. ( row/s: 85 ) Bulk sample was run twice at different labs (UCI, Livermore)

QAQC_Jones et al., 2012 > Duplicate profile row identified. ( row/s: 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ) Can we flag this for follow-up? To see if this still happens with the revised dataset?

Did I fix all the incorrect attributions to Loisel et al. 2014? (should only be 2 cores). Ditto

Treat_S3_edit.xlsx

greymonroe commented 5 years ago

Updated QAQC results using newest datasheets...

Loisel et al., 2014.xlsx _WARNING: missing values where required: lyr_obs_datey.... many not sure whats going on exactly, but its probably related to the reference issue.

Zoltai et al., 2000.xlsx WARNING: Duplicate layer row identified. ( row/s: 8484 8486 ) Looks like these "layers" are actually fractions?

Issues that I think can be fixed at individual dataset level:

Myers-Smith et al., 2008.xlsx WARNING: Duplicate layer row identified. ( row/s: 85 )

Nichols, 1967.xlsx WARNING: Duplicate layer row identified. ( row/s: 31 37 42 44 )

O'Donnell et al., 2011.xlsx WARNING values smaller than accepted min in lyr_bd_samp column (rows 28 50 )

Tarnocai, 2010.xlsx WARNING values greater than accepted max in lyr_bd_samp column (rows 71 )

Vardy et al., 1998; 2000.xlsx WARNING values smaller than accepted min in lyr_bd_samp column (rows 107 )

cctreat commented 5 years ago

Loisel et al., 2014.xlsx WARNING: missing values where required: lyr_obs_date_y.... many not sure whats going on exactly, but its probably related to the reference issue. The remaining cores weren't in Table S1. Updated Table S1.

Zoltai et al., 2000.xlsx WARNING: Duplicate layer row identified. ( row/s: 8484 8486 ) Looks like these "layers" are actually fractions? _Yes it does. That's what the original dataset shows, two fractions. Not sure how they determined, but okay. Also, manual check shows that the sampling year is incorrect throughout the sheet. (lyr_obs_datey)

Issues that I think can be fixed at individual dataset level: Yes, these are issues in these individual datasets. I've double checked these and the data is processing correctly, these are the data that I have. I don't know that they need to be addressed in the underlying datasets, I think that if you want to keep your QC criteria strict, then these profiles (or layers) should be removed.

Myers-Smith et al., 2008.xlsx WARNING: Duplicate layer row identified. ( row/s: 85 ) They ran these radiocarbon dates twice on the same sample and report both dates.

Nichols, 1967.xlsx WARNING: Duplicate layer row identified. ( row/s: 31 37 42 44 ) They ran these radiocarbon dates twice on the same sample and report both dates, then discuss how they don't believe either one of them.

O'Donnell et al., 2011.xlsx WARNING values smaller than accepted min in lyr_bd_samp column (rows 28 50 ) Issue in underlying dataset with 1 profile

Tarnocai, 2010.xlsx WARNING values greater than accepted max in lyr_bd_samp column (rows 71 ) Issue in underlying dataset with 1 profile

Vardy et al., 1998; 2000.xlsx WARNING values smaller than accepted min in lyr_bd_samp column (rows 107 ) Issue in underlying dataset with 1 profile

Treat_S3_edit.xlsx Treat_S1_edit.xlsx Treat_S2_edit.xlsx

greymonroe commented 5 years ago

Also, manual check shows that the sampling year is incorrect throughout the sheet. (lyr_obs_date_y)

So after checking into this, I realized that the code was matching coring_year to reference (entry_name in ISRaD) in S1, rather than to 'ID (Auth-Site-CoreID)' (pro_name in ISRaD). This is a problem because I just found that there are many 'ID (Auth-Site-CoreID)' not in S1. Are all of the 'ID (Auth-Site-CoreID)' supposed to be in S1?

cctreat commented 5 years ago

Are all of the 'ID (Auth-Site-CoreID)' supposed to be in S1?

That was my original idea. That there are so many of the profiles missing from S1 might mean that I had a reason to remove them (maybe they didn't make it into the peat property analysis, which was the original use of the file). But it doesn't make much sense to me now. I don't know why the layers stayed in Table S2 and S3 and not in Table S1.

I will need my other computer to figure this out (why I took them out) and to add them back. It looks like the lat/long info for these some of the sites that are in S3 and not in Table S1 will need to be updated (Holmquist) but many of the others are okay.

I will work on this during the afternoon and tomorrow morning. I was planning to work on getting some numbers out of this dataset anyways, so I will investigate further.

Sorry about this. Everything always seems fine until someone actually wants to do something with your datasets....

greymonroe commented 5 years ago

I will work on this during the afternoon and tomorrow morning.

OK sounds great. Let me know when youre ready with the new files

greymonroe commented 5 years ago

I will work on this during the afternoon and tomorrow morning.

OK sounds great. Let me know when you're ready with the new files

cctreat commented 5 years ago

Grey, this is going to be a bit more intensive than I realized. I can't easily see where these sites fell out, although I do have the list of them. So I will need to add these manually, which will take a bit of time (but hopefully not too long, except that things are a little chaotic here because I'm moving this week). Stay tuned!

aahoyt commented 5 years ago

Some general issues identified by the expert review that could be changed in the scripting were:

Definitely Fix:

Details: https://docs.google.com/spreadsheets/d/1SKNlsYIAgLapBfsG6bK4PdQQNLAHFDvBl8L5zi2Gr44/edit#gid=0 Summary: https://docs.google.com/spreadsheets/d/1SKNlsYIAgLapBfsG6bK4PdQQNLAHFDvBl8L5zi2Gr44/edit#gid=848513150

greymonroe commented 5 years ago

frc_input is filled with the frc_name, but should be lyr_name of the associated dummy layer

@ahoyt can you explain. Not sure I totally understand

jb388 commented 5 years ago

frc_input should never be the same as frc_name. Unless there's a sequential fractionation scheme, frc_input will always be the same as lyr_name for a given record.

greymonroe commented 5 years ago

frc_input should never be the same as frc_name. Unless there's a sequential fractionation scheme, frc_input will always be the same as lyr_name for a given record.

OK in got it now. Fixed.

greymonroe commented 5 years ago

Should fill in the frc_obs_date_y where possible (eg should match dummy layer date)

Is this different from what is done with ISRaD.extra?

greymonroe commented 5 years ago

frc_input is filled with the frc_name, but should be lyr_name of the associated dummy layer

Done

frc_property should say "macrofossil". Didn't we have this before? Seems to be missing on many (all?) templates

Done

"Dated Material" should go into frc_note . Didn't we have this before? e.g. frc_note: "Dated material: plant debris"

Done

Loss of descriptive site name "Site". Possible options: (1) Use this + coords for site name (instead of entry_name + coords). or (2) Put descriptive name into "site_note" field

Done

Similarly, "Core" could go into pro_note

Having trouble with this since Core isnt found in sheet 2 for some reason

cctreat commented 5 years ago

Hi guys,

I started going through some of the issues with the Auth.Site.CoreID mismatch between the tables.

Fixing it is taking forever because many of the cores that have radiocarbon data in Table S3 were never entered into Table S1 (because they didn't meet my study criteria, but I had already extracted the data). So I have to go back to the original paper and pull out and enter the data for Table S1. Which is incredibly time intensive.

Most of these cores are not in the archived dataset on Pangaea that we were originally trying to pull from but I think came from data that I sent Alison over email.

Should we limit the scope a bit more and just go with the cores in the Pangaea repository? And I update that dataset in the repository? There are about 6-7 of those sites. Then we can filter Table S2 on the repository to include only the cores that actually have radiocarbon?

Regardless, I think that we should probably filter table s2 to only include the cores that have radiocarbon data.

Thoughts?

Claire

aahoyt commented 5 years ago

Thanks for looking into this!

Yes, limiting the scope is fine. We can definitely limit to using the Pangaea files. We had script pulling the files from the website but at some point switched to the offline version. I think we assumed they were the same. So, we should be able to switch it back.

Would filtering S2 to only cores that have radiocarbon data reduce the amount of work? Or what is the motivation there? We can certainly do it, but it also doesn't hurt to include the additional data.

The goal in checking for discrepancies across S1/S2/S3 is to check for errors and make sure the data makes sense, not to add a huge amount of extra work for you! :)

jb388 commented 1 year ago

Closing this as Treat data has been added