PBrockmann / PANGAEA_Scraping

web scraping from the OA-ICC PANGAEA
MIT License
1 stars 2 forks source link

Figure of variables of the seawater carbonate system reported in the data sets #4

Closed Yan-yang35 closed 2 years ago

Yan-yang35 commented 2 years ago

Hi Patrick,

Any progress on the figure of variables of the seawater carbonate system reported in the dataset which is similar to the attached one? In this histogram, one bar for TA, one for pH (one color for total scale, one color for other scale), one for DIC, one for pCO2, one for fCO2, one for CO2 and one for CO3. The number of datasets reported each variables could be calculated based on the CSC Flag. For example, 10 datasets with CSC Flag [8] 5 datasets with CSC Flag [9] Number for TA is 10, DIC is 5, pH(total scale) is 15

I am not sure if it is clear for you. If it is still confusing, feel free to talk with Fred.

Many thanks and best regards, Yan

1637200991(1)

PBrockmann commented 2 years ago

Hi Yan,

Indeed your request needs some clarifications.

1) I have produced the following histogramm without feedbacks from you. Could you make some ? image It has been produced by investigating the CSC Flag from each datasets.

There are much more CSC Flags than the 5 categories you have mentionned. Could you clarify and tell me what CSC Flags you want with the vocabulary used in the paper ? Not just DIC, TA, pH(total scale) that are not in the CSC Flags words.

2) In the case where multiple CSC Flags are detected, what do you want to do ? image

Please precise in what categories they have to be counted ?

3) For datasets that have children. What do you want to do ? Count CSC flags for each child or sum up them as one for the parent ? In the last case, same problem as mentionned in question 2, what has to be done when there are different flags ?


PROBLEM: Data set is of type parent, please select one of its child datasets PANGAEA.901178 ['doi:10.1594/PANGAEA.901064', 'doi:10.1594/PANGAEA.901177', 'doi:10.1594/PANGAEA.901172', 'doi:10.1594/PANGAEA.901176'] PANGAEA.901064 ---> CSC flag: [8] PANGAEA.901177 ---> CSC flag: [8] PANGAEA.901172 ---> CSC flag: [8] PANGAEA.901176 ---> CSC flag: [8]

====> 4 children with the same CSC flag ? Counted as one at parent level or 4 at children level ?


PROBLEM: Data set is of type parent, please select one of its child datasets PANGAEA.778456 ['doi:10.1594/PANGAEA.778451', 'doi:10.1594/PANGAEA.778449', 'doi:10.1594/PANGAEA.778453', 'doi:10.1594/PANGAEA.778450', 'doi:10.1594/PANGAEA.778454', 'doi:10.1594/PANGAEA.778448', 'doi:10.1594/PANGAEA.779688', 'doi:10.1594/PANGAEA.778455'] PANGAEA.778451 ---> CSC flag: [29] PANGAEA.778449 ---> CSC flag: [29.0, nan] PANGAEA.778453 ---> CSC flag: [29] PANGAEA.778450 ---> CSC flag: [29.0, nan] PANGAEA.778454 ---> CSC flag: [29] PANGAEA.778448 ---> CSC flag: [29.0, nan, 15.0] PANGAEA.779688 ---> CSC flag: Not available PANGAEA.778455 ---> CSC flag: [29]

====> 8 children with different CSC flags. Some of them with multiple flags. Tell me the rules you want to count them.

All this log output comes from https://github.com/PBrockmann/PANGEAE_Scraping/blob/main/PANGAEA_getData_ex1.py.ipynb you can inspect as well.

Thanks Regards Patrick

PBrockmann commented 2 years ago

See last notebook https://github.com/PBrockmann/PANGEAE_Scraping/blob/main/PANGAEA_getData_ex1.py.ipynb

Yan-yang35 commented 2 years ago

Following figure looks great for me. Just some minor things need your help to change:

  1. Title of y axis should be "Percentage of datasets"
  2. Remove black edge lines of bars Many thanks again! image
PBrockmann commented 2 years ago

Here without line edges

image

This is the percentage of types of measurement. Not the number of datasets. 1295 types of measurement from 1278 datasets where a CSC flag is available. Remember that some datasets have multiple CSC flags, so they are counted several times.

Please discuss that with Fred.

Yan-yang35 commented 2 years ago

I have discussed with Fred and we thought it can be termed as "Percentage of measurements". Kindly let me know what you think.

PBrockmann commented 2 years ago

I think that "Percentage of types of measurement (%)" or "Percentage of measurement types (%)" would be clearer.

Yan-yang35 commented 2 years ago

I talked with Fred again and I convinced him to show this figure as percentage of datasets. For the dataset with mutiple flags, please kindly count them follow the instruction in attached file. For example, in data set 833621 with flag [15, 26], pH (other scale), TA and DIC were measured, so we only count 1 dataset for pH, 1 dataset for TA and 1 dataset for DIC. Many thanks again! Mutiple flags count.xlsx

Yan-yang35 commented 2 years ago

I talked with Fred again and I convinced him to show this figure as percentage of datasets. For the dataset with mutiple flags, please kindly count them follow the instruction in attached file. For example, in data set 833621 with flag [15, 26], pH (other scale), TA and DIC were measured, so we only count 1 dataset for pH, 1 dataset for TA and 1 dataset for DIC. Many thanks again! Mutiple flags count.xlsx

Hi Patrick,any question on the comment above?

PBrockmann commented 2 years ago

Yes, this is kind to have produced a spreadsheet but this not as useful as a proper algorithm. Here are the flags Fred and you send me.

# AT : flags 4, 8, 11, 13, 15, 24 et 26
# CT : flags 5, 9, 12, 14, 15, 25 et 27
# pHT : flags 1, 6, 7, 8, 9, 21
# pH (other scale) : flags 26, 27, 28, 29
# pCO2 : flags 21, 22, 23, 24, 25, 29

I have used them to categorize the datasets accordingly.

PBrockmann commented 2 years ago

Issue I have found a 0 as a CSC flag in https://doi.pangaea.de/10.1594/PANGAEA.718250 All at value 15 except 1 at 0 Suppose it should be Nan

PBrockmann commented 2 years ago

Another issue is the title of the histogram that should not be "Number of datasets" since some datasets have multiple flags and so are counted in different categories. I would recommand to stay with "Percentage of types of measurement (%)" or "Percentage of measurement types (%)".

I have taken into account datasets with multiple flags in the histogram plot seen in https://github.com/PBrockmann/PANGEAE_Scraping/blob/main/PANGAEA_getData_ex1.py.ipynb

Yan-yang35 commented 2 years ago

Issue I have found a 0 as a CSC flag in https://doi.pangaea.de/10.1594/PANGAEA.718250 All at value 15 except 1 at 0 Suppose it should be Nan

Yes, it is Nan.

PBrockmann commented 2 years ago

Yes indeed. Should be corrected no (on the dataset) ?

Yan-yang35 commented 2 years ago

Sorry, I know very less about programing so it is not easy for me to explain it by algorithm. Firstly, we should separate the datasets with only 1 CSC flag with multiple CSC flags. For datasets with only 1 CSC flag, no problem to follow below rule to categorize the datasets:

AT : flags 4, 8, 11, 13, 15, 24 et 26

CT : flags 5, 9, 12, 14, 15, 25 et 27

pHT : flags 1, 6, 7, 8, 9, 21

pH (other scale) : flags 26, 27, 28, 29

pCO2 : flags 21, 22, 23, 24, 25, 29

For datasets with multiple CSC flags, we need you help to find a way to only count 1 time for TA, pHT, pH (other scale) and pCO2 for each dataset. Maybe by some code like: "CSC_flag_2=d2[d2[‘len’]>=2] For i in np.range(0, CSC_flag_2.size) If CSC_flag_2[i]=[ 4, 8, 11, 13, 15, 24 et 26] Count 1 for TA and end" My code is incorrect, but the purpose is to avoid the duplicate. Similar ways to count pHT, pH (other scale), DIC and pCO2.

PBrockmann commented 2 years ago

Do not understand your point. As written, I have already computed this with boolean operators like bellow:

d2['CSC flag'].apply(lambda x: bool(set(x) & {4, 8, 11, 13, 15, 23, 24, 26}))])]

Datasets with multiple CSC flags are counted as Fred and you describe the categories. Please inspect the code already written. https://github.com/PBrockmann/PANGEAE_Scraping/blob/main/PANGAEA_getData_ex1.py.ipynb

Yan-yang35 commented 2 years ago

Yes indeed. Should be corrected no (on the dataset) ?

Yes, I will correct it in the dataset, but it will take some time. Because it is a old dataset which was not archived by me, so I have no access to reimport it, I will ask the PANGAEA staff to do it.

PBrockmann commented 2 years ago

Please test in any python console, this snippet of code. It uses boolean between sets to test if a set of items intersects another one (so not counted 2 times, since I use bool function).

histoSets = {
    'AT':  {'values': {4, 8, 11, 13, 15, 23, 24, 26}}, 
    'pH':  {'values': {1, 6, 7, 8, 9, 21}},
    'pH (other scale)': {'values': {26, 27, 28, 29}},
    'CT': {'values': {5, 9, 12, 14, 15, 25, 27}},
    'pCO2': {'values': {21, 22, 23, 24, 25, 29}}
}

#Flag [8, 15] : 
#Flag 8 is pH and AT
#Flag 15 is AT and CT
#Count 1 for pH, 1 for AT and 1 for CT.

# Different examples
x = [8, 15]
#x = [15, 8, 9]
#x = [26, 8]
#x = [9, 27]

for key in histoSets.keys():
    print(key, bool(set(x) & histoSets[key]['values']))
Yan-yang35 commented 2 years ago

Issue I have found a 0 as a CSC flag in https://doi.pangaea.de/10.1594/PANGAEA.718250 All at value 15 except 1 at 0 Suppose it should be Nan

It has been corrected as Nan now.

PBrockmann commented 2 years ago

Notebook and histogram plots updated. https://github.com/PBrockmann/PANGEAE_Scraping/blob/main/PANGAEA_getData_ex1.py.ipynb 1269 datasets. 1285 CSC flags. 3516 types of measurement (AT, pH, pH (other scale), CT, pCO2).