dafenner / CrowdQCplus

Quality control (QC) for citizen weather station data.
GNU General Public License v3.0
17 stars 2 forks source link

Help with m5 #24

Closed HXP836 closed 2 years ago

HXP836 commented 2 years ago

Hi there,

I've been running CrowdQC+ on my undergraduate dissertation data and level m5 does not seem to be working for my data. Just wondering how I set keep_isolated = TRUE?

Best regards,

Harry

jkittner commented 2 years ago

Hi Harry, thanks for reaching out. It would be helpful to have a way to reproduce your issue. Could you provide some runnable example code and data and describe exactly what you expect to see? Then we can have a look what the issue is and maybe suggest some solution or fix a potential bug.

HXP836 commented 2 years ago

Of course, no worries.

Here's the code I've been using to run the test, I've adapted it from the code written for the example workflow for CrowdQC+

Year_data<- cqcp_padding(Year_data)
Ok <- cqcp_check_input(Year_data)

if(ok) {
  Year_data_qc <- cqcp_qcCWS(Year_data) #QC  
  n_data_qc <- cqcp_output_statistics(Year_data_qc) # output statistics
}

I'm not quite sure how to provide my data, it's a csv file of Netatmo weather station readings but is quite large as its over a whole year.

When I run the CrowdQC+ test all the levels work fine except m5 as I have only isolated stations with too few buddies. It suggests to increase the radius, decrease the number of buddies or set keep_isolated=TRUE.

I was just wondering how to make any of my changes as I'm not sure how to adapt my code to do so.

Thankyou and hope this helps,

Harry

dafenner commented 2 years ago

Hi Harry, thanks for your question. It would be helpful if you could provide the output of cqcp_check_input(Year_data) to see the amount of stations you have. Settings for QC level m5 somewhat depend on the amount of stations you have and how far apart they are, i.e., the density of your network, and the default parameters might not be ideal for your case. In general, you can set the parameters for QC level m5 (actually, for all QC levels) in function cqcp_qcCWS. So, e.g., to increase the radius in QC level m5 to 5000 m:

Year_data_qc <- cqcp_qcCWS(Year_data, m5_radius = 5000) #QC

Similarly, this works for the other parameters. Check out the help with

?cqcp_qcCWS

and

?cqcp_m5

Hope that helps!

HXP836 commented 2 years ago

Sure thing, here it is:

`Year_data_qc <- cqcp_qcCWS(Year_data) #QC
[CrowdQC+] QC level m5 could not meaningfully be performed with current configuration (only isolated stations with too few buddies). All flags m5 = FALSE. Consider increasing the radius, decreasing the number of buddies, or setting 'keep_isolated = TRUE'.

n_data_qc <- cqcp_output_statistics(Year_data_qc) # output statistics ++++++++++++++++++++++++++++++

  • CrowdQC+ output statistics + ++++++++++++++++++++++++++++++ Raw data: 1092058 values, 162 stations QC level m1: 713640 values (= 65.35 % of raw data), 98 stations QC level m2: 647941 values (= 59.33 % of raw data), 98 stations QC level m3: 646142 values (= 59.17 % of raw data), 98 stations QC level m4: 642037 values (= 58.79 % of raw data), 98 stations QC level m5: 0 values (= 0.00 % of raw data), 0 stations QC level o1: 0 values (= 0.00 % of raw data), 0 stations QC level o2: 0 values (= 0.00 % of raw data), 0 stations QC level o3: 0 values (= 0.00 % of raw data), 0 stations `

That's great, thankyou so much. I'll check that out and see if it improves my outputs.

Cheers,

Harry

dafenner commented 2 years ago

Ok, I see that you have a maximum of 162 stations and that after QC level m1 you are left already with only 98. Interesting to see that so many stations have invalid/identical lat/lon values.

To keep all (isolated) stations in QC level m5 you can always set:

Year_data_qc <- cqcp_qcCWS(Year_data, m5_keep_isolated = TRUE)
HXP836 commented 2 years ago

Yes that is interesting!

To investigate I've switched around the lat and lon values because this data is collected from Melbourne, Australia and I have a suspicion that the Netatmo weather stations records them the wrong way round.

I ran the test again and recieved this output:

` Year_data_qc <- cqcp_qcCWS(Year_data) #QC

n_data_qc <- cqcp_output_statistics(Year_data_qc) # output statistics ++++++++++++++++++++++++++++++

  • CrowdQC+ output statistics + ++++++++++++++++++++++++++++++ Raw data: 1048575 values, 154 stations QC level m1: 722297 values (= 68.88 % of raw data), 102 stations QC level m2: 655584 values (= 62.52 % of raw data), 102 stations QC level m3: 653886 values (= 62.36 % of raw data), 102 stations QC level m4: 649715 values (= 61.96 % of raw data), 102 stations QC level m5: 184476 values (= 17.59 % of raw data), 43 stations QC level o1: 191703 values (= 18.28 % of raw data), 43 stations QC level o2: 122772 values (= 11.71 % of raw data), 43 stations QC level o3: 44779 values (= 4.27 % of raw data), 24 stations`

Not sure why some stations go missing from the raw data though?

dafenner commented 2 years ago

Thanks.

To investigate I've switched around the lat and lon values because this data is collected from Melbourne, Australia and I have a suspicion that the Netatmo weather stations records them the wrong way round.

That would be a little strange but easily identifiable via a GIS or other mapping application (by checking one or two stations).

I ran the test again and recieved this output: Not sure why some stations go missing from the raw data though?

Me neither and if you simply renamed the columns it shouldn't make a difference, the number of raw stations should stay the same.

You could investigate the two data sets by checking:

p_id <- unique(Year_data$p_id)

Please check again and report if problem remains.

HXP836 commented 2 years ago

That's great, thankyou

I've checked the data sets with the Lat and Lon for the original values and then with them swapped round. I found that the output reported 154 weather stations for each, suggesting that this is the correct value. I think the different number of weather stations may be due to the some weather stations of the same ID having different Lat and Lon values for different months which I found out in previous analysis of the data.

Again, thankyou for all your help. I think the CrowdQC+ package will be very beneficial for my dissertation!

Harry

dafenner commented 2 years ago

Glad to hear it is useful for you!

As an addition to your comment: Since Netatmo IDs stay the same, even if the station moves to a new location (noticeable via changed/different coordinates), it is useful to assign a new ID (which you could, e.g., name p_id) to each unique set of Netatmo ID, lat & lon. If then someone moves the station and the coordinates are changed, that new timeseries gets a new unique ID. That way you make sure to keep track of changing metadata and that you do not mix, in a worst case, data of the same Netatmo ID but from very different locations (within a city). When you do this check, you could check if a station is moved within a radius of maybe 20-50 meters (GPS uncertainty or moving of a station on the same property) or more (to an entire new location). Netatmo does not keep track of changing metadata, so it's our job to do so. ;-)

dafenner commented 2 years ago

Closing the issue (for now), let us know if you need additional support when using the package, @HXP836 .