The PronaSolos dataset has many observations with exactly the same geographic coordinates. The values of the soil properties, however, differ from observation to observation. This is an issue for model training.
The apparent duplication of observations was caused by the way observations lacking coordinates were handled by the PronaSolos team. According to the project documentation, such observations were placed at the center of the polygon of the municipality in which they were reported to have been sampled. Thus, if two or more observations were sampled in a single municipality, they ended up positioned in the same exact location. Thus, the apparent duplicates.
I think that the best approach is to remove the PronaSolos dataset from the analysis as we can not easily identify the observations that went through this process.
The PronaSolos dataset has many observations with exactly the same geographic coordinates. The values of the soil properties, however, differ from observation to observation. This is an issue for model training.
The apparent duplication of observations was caused by the way observations lacking coordinates were handled by the PronaSolos team. According to the project documentation, such observations were placed at the center of the polygon of the municipality in which they were reported to have been sampled. Thus, if two or more observations were sampled in a single municipality, they ended up positioned in the same exact location. Thus, the apparent duplicates.
I think that the best approach is to remove the PronaSolos dataset from the analysis as we can not easily identify the observations that went through this process.