BiologicalRecordsCentre / ABLE

Assessing ButterfLies in Europe project repository
2 stars 3 forks source link

15-minute count KML downloads updates #374

Open DavidRoy opened 3 years ago

DavidRoy commented 3 years ago

divide the kml into two files:

johnvanbreda commented 3 years ago

@DavidRoy before I do this, the occurrences KML download will need to be driven from PostgreSQL rather than Elasticsearch, as we don't currently have an Elasticsearch -> KML option in the code at the moment. Therefore there's a performance implication of this change. Implementing Elasticsearch -> KML is possible I'm sure, but would not be a trivial task.

DavidRoy commented 3 years ago

@johnvanbreda how much work to implement an ES option for this? This is not needed immediately

johnvanbreda commented 3 years ago

1 to 2 days.

DavidRoy commented 3 years ago

Let's leave this job for the time being. Can you action the other changes to the existing KML download. Does this KML download come from ES?

johnvanbreda commented 3 years ago

@DavidRoy No, the KML only comes from PostgreSQL at the moment.

On the user download page, I've renamed the id column to visit_sample_id in the samples KML file and added an occurrences KML file (will have to hope the performance is OK as data volumes grow for now). Are these changes OK? If so I'll apply them to the main downloads page.

DavidRoy commented 3 years ago

Thanks John. @CrisSevilleja can you test please

CrisSevilleja commented 2 years ago

Sorry I didn't test this improvement. @chrisvanswaay downloaded 15-min count data and th file of the occurrence kml is good with all the needed attributes. However, he is missing a few attributes in the track kml of 15-min counts:

Another question @DavidRoy is to know if with the kml of 15-min count we are downloading all the routes done (with or without occurrences). Important to know which routes have zero values

chrisvanswaay commented 2 years ago

Hi all, on 1 Oct last year I emailed: probably the problem would be solved if the .csv would only contain the butterfly data (so it would get smaller), and there would be a separate csv with weather, recorder etc. per sampleid (as the sampleid is already in the kml as well as in the csv), including such data on transects without butterflies, simply for all transects.

Let me try to explain. I need two files for the assessment of the 15min counts:

  1. A file with the the transect information: sampleID, date, time, user, geom, weather. This can be a kml, shape, csv with the geom info as json-string or whatever. This should also contain all transects where nothing was seen.
  2. A csv with the butterfly records, linking to the file above with a sampleID, and then species, date, time, lat, long, number and the other details.

I think that with these two files I can do everything. Hope this is doable.

DavidRoy commented 2 years ago

@chrisvanswaay don't we have this already? Via Scheme Admin - Downloads. You have access to everything and a role for the Netherlands. So you need to select 'Dataset' first via the drop-down option. The 'sample' file includes 15-minute counts with no butterflies counted.

ebms downloads

In our terminology number 1 on your list is samples and number 2 is occurrences

chrisvanswaay commented 2 years ago

Hi David, I downloaded three (out of five) files (for some analysis on the Madeira data). These were:

timed count occurrences Contains 19 fields, they look like observations to me, but they also contain data on the whole transect, as start and endtime (I guess of the transect, not of the pure observation) and weather data:

timed count sample data This contains the sample data I was looking for. However I am not sure that all transects (so including the ones with NO butterfly observations) are included, as I remember from earlier trials that I discovered they were missing. Can someone confirm if also 'empty' samples are in?

tracks and areas.kml In Madeira the kml file was the only one who did have all the tracks. However this one only as three fields, and the second one is always empty (Description). The first field makes no sense and is not connected to something else (records.1). There is a geometry however.

I was not interested in the single-species transect or the gpx.

If we can be sure that all transects are in time count sample data, and the timed count occurrences has all observations, and there is no direct function for the kml (I think). Is that correct?

DavidRoy commented 2 years ago

@chrisvanswaay some users prefer .kml and .gpx downloads. I confirm the 'empty' samples are included - you can see that there are samples with 0 taxa and 0 individuals @johnvanbreda can you confirm if the .kml includes all samples (including those with no occurrences)? Can you also check the included fields make sense. If these gis downloads don't include samples with no occurrences then we should remove them

chrisvanswaay commented 2 years ago

Thanks David, you are right, gpx and kml will be very useful for many. I now know I can restrict to download the two csv files (packed in a zip), as they have all the data I need.

johnvanbreda commented 2 years ago

@DavidRoy the .kml file does include all samples - there is no link to the occurrences data in the report used. The fields included are the sample ID of the visit, the date and the grid ref or lat/long point, so they do make sense though are very limited. The one quirk is that the date information includes separate columns for the date start, end and type as well - these are hidden fields in the report but the KML downloader does not seem to respect the hidden flag on a report column.

I cannot see a description attribute in the KML download - I wonder if that is added by whatever tool is being used to view the KML file?

chrisvanswaay commented 2 years ago

Just to be sure: the times in the sample file are local times? Which means that if a sample starts at 06:00:00, that means that is very early, and not 06:00:00 GMT (meaning 08:00:00 on European mainland in summer).

chrisvanswaay commented 2 years ago

I notice that in the download file timed count occurrences.zip (for all the data), which contains >100000 records of butterflies, the Visit Sample ID is missing for 20000 records or so, meaning I cannot link them to the length and location of a 15 minute count (from timed count sample data.zip). These are 'old' records. I always thought that for the 'old' records we only collected the total transect (as geom) and the butterflies on them, not the exact location of the records, but this seems to be the other way around. Actually then the solution could have been to add the geom for the whole transect to the observation ("we don't know where exactly this butterfly was seen, but it was somewhere on this transect"). But no Visit Sample ID, but exact locations for the butterfly, is new to me. Now I understand things have changed a few times, but we should find a way to deal also with the 'old' records. Or am I missing something?

DavidRoy commented 2 years ago

@johnvanbreda can you investigate this point about missing SampleID from earlier data. Is this because we switched from a sample->occurrences data structure to a sample->sub-samples->occurrences structure?

johnvanbreda commented 2 years ago

@DavidRoy it seems there is a mixture of data where we have a parent sample ID (visit) containing a list of precise samples and where the occurrences are attached to a sample with no parent. I don't think it's as simple as old vs new data though as I can see data from all versions of the app which have no parent sample, though it seems the use of parent samples only appears in data from version 1.10 onwards. Is there something about the app which causes it to behave in 2 different ways?

johnvanbreda commented 2 years ago

@chrisvanswaay the times in the download are as reported by the mobile application. I believe this to be the time on the device but @kazlauskis can confirm?

DavidRoy commented 2 years ago

@johnvanbreda yes, the app functionality changed. The original survey design was for each occurrence to be attached to a sample (with an associated GPS track or a user-drawn polygon). The current survey design is for each occurrence to have it's own sub-sample (and point location), and these sub-samples to be within a sample (with GPS track or polygon)

johnvanbreda commented 2 years ago

There do seem to be some samples without a parent sample even in the data generated by the latest app version, e.g. sample IDs 18083646, 18108270, 18089546, 18136744. Though, I think what is happening is the structure is still 2 levels, but there are occurrences that are attached to both the parent and the child samples. So, some of the occurrences only have the polygon/GPS track of the visit as their locality, whereas some have a full precision point.

Currently the timed count samples download is designed to only include samples where there is a parent sample. That means occurrences which point to a top-level sample ID from new versions of the app, or occurrences that point to a single level structure from old versions of the app, will have a sample ID that is not in the samples file. Therefore I propose including all samples in the samples download, so including parent (visit) + child (sample point), or single-level samples from old data. Is that OK @DavidRoy?

The tracks and areas KML/GPX downloads only include samples where there is not a parent sample. This means the file includes the early single level data as well as the visit parent samples from the newer data, so I think that is OK. @chrisvanswaay note that if you are matching from the occurrences file to the tracks and areas file, if the Visit Sample ID is missing for an occurrence, you should still be able to find the Point Sample ID's sample in the tracks and areas file.

kazlauskis commented 2 years ago

@johnvanbreda In case the sub-sample fails to capture a precise location (timeout, under bridge etc) then such occurrence will be directly attached to the top parent sample upon the submission. This can potentially result in a survey with occurrences with the 2 samples above and some with only 1 sample.

DavidRoy commented 2 years ago

@johnvanbreda I agree to your suggestion to extend the download to include parent (visit) + child (sample point) etc. The key thing is that users can download everything from the survey, even if they have to join across download files (samples and occurrences)

chrisvanswaay commented 2 years ago

Thanks to all, I think we are getting closer. @johnvanbreda still the kml is unclear to me. It looks like: Name Description geometry 1 records.1 LINESTRING Z (-1.354157 53.... 2 records.2 POLYGON Z ((-1.111025 51.60... 3 records.3 LINESTRING Z (-1.115607 51.... 4 records.4 LINESTRING Z (-1.110211 51.... and simple has 7332 records. How can I link these to butterfly records?

I noticed that in the butterfly record table the 'old' data (where there is no Visit Sample ID) the records are summarized at a point (I guess the centroid or so). Now this fits with the fact that in those days the separate observations were not recorded with exact location, but I would like to link these to the LINESTRING. How should I do that?

johnvanbreda commented 2 years ago

I've made a change so that the downloaded attribute columns (e.g. time) are filled in from the child sample or the parent sample depending on the structure, so there will be less blanks.

@DavidRoy I'd read the code incorrectly - the samples download and the track downloads both enforce the same filter, including only samples which do not have a parent. This means the following:

  1. For old data, the occurrence won't have a Visit Sample ID, but the Point Sample ID should link to a Visit Sample ID in the samples or tracks download file.
  2. For new data where the occurrence has a point sample within the visit sample, the Visit Sample ID column should be populated and will point to a visit sample in the samples or tracks download file.
  3. For new data where an occurrence fails to capture a precise location, then it is similar to point 1 for old data - the Point Sample ID will actually point to the Visit Sample ID.

In effect, although the column title is Point Sample ID, this will be the same thing as the visit sample ID where the occurrence is attached at the top level. So all rows can be linked across to the samples/tracks files by using the Visit Sample ID (if available) or the Point Sample ID (if Visit Sample ID not available). @DavidRoy does this mean there is no need to expand the samples download to include all the child samples? If we do that it will expand the size of the file quite a lot and it's not necessary if the user understands this logic.

@chrisvanswaay this should explain how to make the link from the occurrences to the KML file but let me know if still not clear.

chrisvanswaay commented 2 years ago

Thanks @johnvanbreda , I think I get it. But will now try to implement this, lets see if I manage.

chrisvanswaay commented 2 years ago

@johnvanbreda I gave it a check and downloaded the timed-count occurrences and the timed-count sample data (All data).

When I checked my own data, I noticed my first sample is from 30-7-2019, so after my holiday in Italy, when I did quite a lot of 15 min counts. Those transects now seem not to be available (but they have existed). I also can't find older data on https://butterfly-monitoring.net/mydata/samples, also there 30/7/2019 seems to be the oldest transect.

When I check the occurrences however I can find the observations from the trip to Italy, the oldest are from 16/7/2019. There is a pointid (e.g. 6172680), but there is no transect linked to that in the sample file. The observations seem to be summarized ( at the centroid or starting point of the transect?).

So not everything seems solved.

PS I can live with these transect missing what me concerns, but there will be more. Would be good to get as much data as possible back.