BiologicalRecordsCentre / ABLE

Assessing ButterfLies in Europe project repository
2 stars 3 forks source link

Downloads exports - % Sun #493

Closed xaviermestdagh closed 1 year ago

xaviermestdagh commented 1 year ago

Hi, What is the meaning of for instance the value "20; 20", and why is it sometimes just "20" for some records (missing Transect Code value)? image

JimBacon commented 1 year ago

Hi,

I found some documentation which says

Note that if requesting an event attribute value, the parent events attribute values will also be included in the output, so when requesting an attribute value it is not necesssary to know if the value will be stored at the event or parent level.

This leads me to believe that 20; 20 represents the percentage sun for the transect section combined with the overall percentage sun for the walk. I haven't yet worked out which order they are in yet.

For the anomalous records I find they have two entries in the ElasticSearch index. E.g. image

This is wrong. Once this has been fixed (which I also don't know how to do yet) I think all records will show the two sun values.

JimBacon commented 1 year ago

No, it was me that was wrong. There area two versions of the record because it is marked as sensitive. I think one version will contain full details of the record while the other may hide some. Whether it is intentional to hide the parent attributes I am not sure.

JimBacon commented 1 year ago

I'm thinking this is probably a bug. Both list_for_elastic_all.xml and list_for_elastic_sensitive_all.xml provide parent_sample_attrs_json to LogStash but, while indicia_support_files\Elasticsearch\logstash-config\occurrences-http-indicia.conf processes this in to a JSON field which ends up in event.parent_attributes, occurrences-http-indicia-sensitive.conf does nothing with it so it ends up as text in a field called parent_sample_attrs_json of the ElasticSearch index.

Similarly, the Transect code is missing from the download because occurrences-http-indicia-sensitive.conf does nothing with recorded_parent_location_code so it ends up as text in a field of the same name rather than being known as location.parent.code. Likewise recorded_location_code is also not renamed.

The download is designed to extract records at full precision so no details of an EBMS record should be hidden.

@johnvanbreda This is unfamiliar territory for me. I'm pretty confident I've found the issue that the LogStash configuration for sensitive records has not been kept up to date with changes elsewhere. We've picked up 3 fields here which need correcting. Maybe there are others you will know about. I guess we will then need to re-index all sensitive records.

CrisSevilleja commented 1 year ago

Getting back again this issue (that was described before #489 ), this value for %Sun is generating the problem on downloading the csv documents for users including a semi-colon between the sun values. I don't really get well the bug that you found @JimBacon but I don't understand anyway why two values of Sun are included in the same column.

I would like to separate both values into two columns, 20; 20 one representing the percentage of the sun for the transect section and another with the overall percentage of the sun for the walk.

It can look simple to avoid the semi-colon for opening the downloaded file, but the normal users of eBMS don't have much management data skills. Therefore, if we can remove the semi-colon in the download files, users can open easily downloaded files from eBMS directly on excel.

@johnvanbreda was mentioned before regarding this issue.

CrisSevilleja commented 1 year ago

The Slovenian coordinator is checking data from 2022 and found strange the way of %sun is presented on the download file:

"in the excel (you downloaded data for me) for sun (%) it's written "60;80" and for cloud (%) "40". But if I take a look into database, the data show 80 % of sun for sections from 1-5 and 60 for section 6. So, what does the excel show in the collumn "sun" and "cloud" - how can one read the data from excel without being able to see the data entry in the eBMS app?" slo1

In this case, when it is different between sections, is not reported well and It is confusing. I would like to place the correct value on the section and place in another column the overall percentage of the sun to avoid two values in the same column. At the moment the users don't know what means the two values. Any idea how I can answer the last question from the Slovenian coordinator?

thank you in advance. @JimBacon @DavidRoy

larspett commented 1 year ago

I totally support this comment, Harriet and I discussed it yesterday. The %sun and %clouds is very confusing to our users. I understand it could be that the transect is sunny while most of the sky is cloudy, but is the general cloudiness important relative to how sunny the actual transect walk is? Getting two values that don't necessarily sum up to 100 is one part of the confusion. If they are designed to sum up to 100, do we need both of them then?

DavidRoy commented 1 year ago

@JimBacon The requirement here is:

  1. For the counts download the %sun should be for the sub-sample (section) and be a single value
  2. For the visit download the %sun field value is for the sample (transect). Again. A single value

@larspett we have both %sun and %cloud because schemes have different guidance on which to record! They auto calculate depending on what’s entered and sum to 100. So we need to keep for ebms but can simplify the same form on the spring website

larspett commented 1 year ago

@DavidRoy having both is rather confusing, I would prefer a toggle between them or a scheme-dependent visibility, depending on what is the easiest solution. You could lock the visibility in a scheme-dependent fashion i Drupal

xaviermestdagh commented 1 year ago

Hello @JimBacon and @kazlauskis (sorry I don't know if directly related to this issue) An observer (Youri Martin) using the mobil app on iPhone highlighted that the %Cloud at sample level (not section level) is not kept after uploading to the web app (the field %Sun is empty when editing). On my side, when I make an export with the "Scheme admin > Downloads", the columns %Sun neither %Cloud are included. In the export through the "My annual report", the column %Sun is empty but values are provided in the column %Cloud (means data is existing somewhere...). See for instance the walks of 2022 on EBMS:Luxembourg:166 (Haardt sud).
Thinking the info was lost, the observer already opened many other walks one by one on the web app to add the missing value...

kazlauskis commented 1 year ago

@Vilius-Stankaitis can you check if the app uploads the Cloud values and if it does, then add the warehouse attribute ID here for Jim?

Vilius-Stankaitis commented 1 year ago

Yes, the app uploads only Cloud values

Section level seems to be working Screenshot 2023-05-08 at 16.11.45.png

Sample level not working, but the cloud value was uploaded. Screenshot 2023-05-08 at 16.13.18.png

cloud warehouse attribute id: 1457

DavidRoy commented 1 year ago

@Vilius-Stankaitis can the app also submit the %sun value (=100 - % cloud). Seems simpler than a process that runs on the warehouse?

kazlauskis commented 1 year ago

Yes, it would be easy to upload the sun value, too, a bit strange the survey needs both values, though.

DavidRoy commented 1 year ago

Agreed, the difficulty is that some schemes use %cloud and some use %sun. The uk use sun as they are happier reporting sun; spain use cloud as they are happier reporting cloud :-)

JimBacon commented 1 year ago

Still need to resolve the download issues, I think.

JimBacon commented 1 year ago

There is an additional problem to resolve before we can output %Sun and %Cloud at the top level (I'll call it walk level, like the website does).

The website and the warehouse have been set up to record %Sun. It looks like, for a couple of years, the app has been submitting %Cloud, although it is now submitting %Sun and %Cloud as requested above,

What this means, as noted previously, is that if you edit an older app record on the website, the %Sun field is empty and the %Cloud is not shown. If you enter a value for %Sun there is a good chance it won't be (100 - %Cloud) and we get an inconsistency in our data. (It was a bad idea to store both sun and cloud - we should have stored one and deduced the other so there would never be a consistency issue.)

I can see three possible resolutions.

  1. We could stop bothering with recording %Sun/%Cloud at the walk level and calculate values for output as an average of the values recorded at the section level. This is actually what is attempted by the My Data > My Transect Reports > My Annual Report > Downloads > Samples in the 'Mean % Sun' column. (That was broken but I just fixed it.) The input fields could be removed from app and website and we could ignore the data recorded in them to date. By only having the attributes at the section level, the problem with the download showing %Sun like 60;80 (Section %Sun; Walk %Sun) would go away.
  2. I could update the database to create a %Sun value for every walk where there is only a %Cloud. The website could remain the same and %Cloud could be calculated as (100 - %Sun) where it is needed. Values of %Cloud in the database would be deleted as redundant. The app would stop sending %Cloud at the walk level (though retaining it in the user interface).
  3. I could update the database to create a %Sun value for every walk where there is only a %Cloud. I could then update the website to add a %Cloud field and add code to ensure that the two values are always consistent in the same way as happens at the section level. If we wanted extra work, we could allow the user to hide either %Sun or %Cloud, according to their preference, to tidy up the appearance. The app would be unchanged. By having %Cloud for both walk and section we would start to see the download showing both values in one column, like 40;20, although I'd obviously have to fix that like I currently need to for %Sun.

@DavidRoy, what is your preference?

By the way, I guess this has gone undetected because %Sun and %Cloud are not present in either Scheme Admin > Downloads > Download Sample Information from Transects or My Data > My Downloads > Download Sample Information from Transects so I could ask whether it is needed at all!

DavidRoy commented 1 year ago

@chrisvanswaay @CrisSevilleja what is your view on this. An alternative is to remove the %sun (or %cloud) from the transect sections data entry and just have it at the walk level. This was inherited from the UKBMS but I've never seen this data analysed so wonder what the point of collecting it is! What is done in the Netherlands and in the eBMS manual?

chrisvanswaay commented 1 year ago

@DavidRoy We only collect weather data on the visit-level of the whole transect, and that sounds the logical thing for the 15 min counts.

DavidRoy commented 1 year ago

@chrisvanswaay we already collect at the sample/visit level for 15 min counts. This discussion is about the transects. Do you agree that we should remove the %sun requirement at the section (sub-sample) level (after checking with the schemes)?

chrisvanswaay commented 1 year ago

@DavidRoy Sorry for the misunderstanding, I just came in in the hotel (in Bayreuth) and did not have enough time to go through the whole discucssion.

JimBacon commented 1 year ago

Added a fix to allow parent (Walk) and child (Section) attributes to be output independently and not in the same column

Changed configuration of occurrence download for Scheme Admin>Downloads and My Data>My Downloads, replacing

  {"caption":"% Sun","field":"#attr_value:event:1387#"},
  {"caption":"% Cloud","field":"#attr_value:event:1457#"},

with

  {"caption":"Walk % Sun","field":"#attr_value:parent_event:1387#"},
  {"caption":"Walk % Cloud","field":"#attr_value:parent_event:1457#"},
  {"caption":"Section % Sun","field":"#attr_value:event:1387:noparent#"},
  {"caption":"Section % Cloud","field":"#attr_value:event:1457:noparent#"},

This resolves one part of this issue meaning that %Sun will no longer contain two values like "20;20"

CrisSevilleja commented 1 year ago

that is awesome Jim because that generates a lot of problems with the downloads, thanks!

Regarding what to do with the % of sun or clouds. Asking fast to regional coordinators in Spain they don't think the value per section is needed, in general, we recommend volunteers to take the starting and ending values and do the average (also in the manual). If we can reduce the complexity leaving only the average value, go for that but we need to change the website and app entry.

In preference, the majority of schemes in Europe prefer % of clouds rather than %sun. Cheers

xaviermestdagh commented 1 year ago

I agree %Cloud at walk level is great for Luxembourg!

JimBacon commented 1 year ago

I have fixed the way that ElasticSearch indexes sensitive records as proposed above. This fills in the missing transect code and walk %sun/cloud values of records marked as sensitive.

For comparison with the original screenshot which raised this issue, here is what the first few lines of the "download species occurrence from transects" now look like. image

JimBacon commented 1 year ago

I have added the walk-level %Sun and %Cloud to "download sample (visit) information from transects" for both My Data > My Downloads and Scheme Admin > Downloads as requested above.

I am going to close this issue as its original problem has been resolved and I will raise a new issue for sorting out just how we want to record %Sun