Notes on the logic and data structures for NPMS Website, Plant Portal NPMS+, and Plant Portal Standard Mode

andrewvanbreda commented 5 months ago

Squares:Plots:Plot Groups Structure Notes.txt

andrewvanbreda commented 2 months ago

@sacrevert I don't have the data to be able to test it well effectively locally. It won't take long to re-arrange the report as a new version (assuming it allows me to re-arrange in this way). Then can compare on the multidev site attached to the warehouse. It would be good if it is faster (for whatever reason) as there are some reports such as the UKPoMS counting which is slow at moment.

sacrevert commented 2 months ago

Great, would be nice if this could help other projects too 👍👍

kazlauskis commented 2 months ago

Hi both, once the discussions are finished, please send me a summary of the reports to use. I will then add them to the app 🙂 Thank you!

andrewvanbreda commented 2 months ago

Hi @BirenRathod, Would you be able to pull the Plant Portal reports folder when you get a chance please. Cheers.

andrewvanbreda commented 2 months ago

Hi @kazlauskis @sacrevert ,

Report is now almost completely ready. (I just realised I didn't include Standard Mode that actually uses different taxon groups. That will only be a few minutes to change which I will do now, and let you know once it has been pulled onto the Warehouse).

Other than that, the report can still be used as follows

The species_list param can be set to wildflower indicator inventory additional

(The additional mode is the species returned to Plant Portal additional species list which is found on the NPMS+ Widflower/Indicator pages)

count_website_list parameter can be provided as before to limit websites for the count (leaving it out default to NPMS only count)

There is a new parameter occ_row_limit which limits the number of rows we use for the frequency count, when not supplied this is 50000.

An example call to the report could be this

https://warehouse1.indicia.org.uk/index.php/services/rest/reports/projects/plant_portal/app_get_species_lists_4.xml?species_list=wildflower&count_website_list=32,23&occ_row_limit=30000

This would return Wildflower and use 30000 rows from NPMS (website 32),iRecord (website 23) to do a frequency count.

In reality, wildflower and indicator actually return the same species (the list defined by organism keys) and likewise inventory and additional provide the same species (the list is defined by taxon groups)

The common names and synonyms columns include JSON lists of ID/names pairs that need to be included in the species search.

andrewvanbreda commented 2 months ago

The following are some optional notes about the ChatGBT version of the report if you are interested.

I have changed the report structure as per Oli's suggestion from ChatGBT. However, I am not actually noticing a speed difference (although I haven't timed it). I am not sure why this is, I will test it on other reports. It may simply be my report has changed over time, so those changes may have soaked up any speed advantage.

I will still experiment with other reports to see if we can get some speed imrovement.

Either way, one advantage of this alternative report structure is that the new structure allows for the new occurrence count row limiter which wouldn't have worked before.

If you are curious, Chat GBT is doing is the following.

It is taking the conditions that only rely on one table at a time (e.g. preferred = false, taxon_list_id = 15) and these are applied to the table first, and then temporary tables are created using these, and then the joining is done on these afterwards.

It isn't actually doing any rocket science, I am surprised the SQL executor doesn't just do this re-arrangement behind the scenes of its own accord at runtime. However, as it doesn't, this is a useful to know about.

sacrevert commented 2 months ago

It is taking the conditions that only rely on one table at a time (e.g. preferred = false, taxon_list_id = 15) and these are applied to the table first, and then temporary tables are created using these, and then the joining is done on these afterwards.

Is this what is referred to as a Common Table Expression (CTE) approach?

Anyway, glad it has some use!

Just a couple of things: (I) When one has a limit to the frequency count like 30000, is this structured in any way? For example, last 30000 records, or is it random in some way? Just concerned that we don't introduce some important bias in the counts. (II) @kazlauskis I am happy for you to use the website and count limit parameters in Andy's example, assuming that the 30000 rows is not expected to introduce biases. Although, maybe given that we only need to do this very occasionally, perhaps you could use a higher number (e.g. 500000)

andrewvanbreda commented 2 months ago

Hi @sacrevert Yes if you want to give it a name, it does match the definition of CTE.

My code doesn't specify an ordering to the occurrence rows. I just tried on my computer an unordered select and the resulting ordering was not obvious, however it was always the same.

There looks like a couple of ways to make it random.

Use order by random(), but it sounds like this is slow.
Something else, which looks like adds quite a lot of complexity I am not keen to add to a report that is already complex and has been tested.

I guess the problem with random if it does make it slower, then increasing the row count may have the same improvement in avoiding bias for the same or better performance.

Another problem with random is every-time you run the report you might get a different result, although I guess if it is doing being erratic it might be a sign to increase the row count.

I will see if I can make random optional and add it to the report as an on/off param.

sacrevert commented 2 months ago

@andrewvanbreda Thanks for checking. I wouldn;t bother implementing a specific randomiser, as I have little doubt that it will seriously slow the code. I think that given that we don't need to derived these counts very often, then just running it when we do with a reasonably large number (e.g. ~100,000) will be sufficient.

andrewvanbreda commented 2 months ago

Hi again @sacrevert @kazlauskis I have added standard mode. Call it with species_list=standard

This will be available next time someone pulls reports onto the Warehouse, you will know this has been done because the report will return 0 rows in that mode until then.

Before then, feel free to use the other modes as described in the first comment from today, just a few comments prior to here.

andrewvanbreda commented 2 months ago

@sacrevert @kazlauskis The report appears to have been pulled onto the warehouse, so is now complete including Standard Mode.

I guess the only thing left will be for you to close this issue once you are happy, unless you have further comments.

Also @sacrevert Once this issue is closed, I will be deleting that avb-report multidev site (as multidev sites are intended to be temporary). Let me know if you want a version of that page where you (and admins) can view the app reports put on the live site for the future.

sacrevert commented 2 months ago

I dont think admins need access the report, but maybe there is something that can be added to this issue for future reference/ info after closing? Kind of like a manual page for future us? Or maybe there is enough info here and we can just add the "info" tag?

I'm happy for @kazlauskis to say when this can be closed anyway

andrewvanbreda commented 2 months ago

Hi @sacrevert,

I am not really sure we need extra info because the Github report definition shows how to use parameters in the report.

https://github.com/BiologicalRecordsCentre/indicia-reports/blob/master/reports/projects/plant_portal/app_get_species_lists_4.xml

Also the advantage of just checking the report definition directly is that we know it is up to date and the latest information.

andrewvanbreda commented 2 months ago

Hi @BirenRathod Would you be able to pull the Plant Portal app_get_species_lists_4.xml report please if you get a chance. Thanks

BirenRathod commented 2 months ago

@andrewvanbreda I pulled this report now.

andrewvanbreda commented 2 months ago

Hi @BirenRathod I have made an extra adjustment, if you get a chance would be really useful if you could pull the report again. Thanks

andrewvanbreda commented 2 months ago

@BirenRathod Actually also useful to pull EBMS/UKBMS reports if you are able to while you are at it. Thank you

BirenRathod commented 2 months ago

@andrewvanbreda It has only pulled the report of the plantportal. not see any update of EBMS/UKBMS. have you already pushed the commit?

andrewvanbreda commented 2 months ago

@BirenRathod That is OK, it is useful to know it is up to date. I will have a look later at the report I changed, you are probbably right and I haven't pushed it or something. No worries for now. Thanks for the Plant Portal pull.

andrewvanbreda commented 2 months ago

Hi @sacrevert @kazlauskis, I believe this is now working as discussed in our meeting. Oli, do you want to check it, in particular the frequencies? You can change the params in the "Preset parameter values" box on the edit tab. At the time of writing, the report on that page is setup to return a 30000 count of relevant plant occurrence data from NPMS, PP, and iRecord. If those params are not included it will do a 50000 count from NPMS. https://avb-report-brc-plantportal-d10.pantheonsite.io/karolis-species-list-report

andrewvanbreda commented 2 months ago

Hi @BirenRathod Sorry to be really really annoying. Could i get you to pull the Plant Portal report again, I just realised there is a slight improvement to the frequency counting that can be made. Thank you.

BirenRathod commented 2 months ago

@andrewvanbreda This report has been updated now.

andrewvanbreda commented 2 months ago

@BirenRathod Brilliant thanks.

@sacrevert Proceed however you wish, from my perspective the report is ready

sacrevert commented 2 months ago

~~@andrewvanbreda how do this report work on the website?~~ ~~https://avb-report-brc-plantportal-d10.pantheonsite.io/karolis-species-list-report~~ ~~Is the box expecting a species name, a comma-separated list, other? it doesn't return anything if I enter a single species name and click filter.~~

Don't worry, I see that it expects something like "Wildflower"

sacrevert commented 2 months ago

The results below are from the standard settings, Wildflower list, reverse ordered by frequency. Whilst this seems to be returning sensible results on the whole, and much better than previously, there is still something odd going on with the aggregate entities. Things like the Juncus aggregate and the Rumex pairing should have counts greater than 0 if the NPMS dataset is being queried. I expect that this is something to do with the shift to using organism keys -- perhaps the current version of the UKSI does not contain a link between the organism key and the TVK in these cases? I don't think that this is something that we will be able to solve in advance of the app going live, and in some ways it doesn't matter as these aggregates are most important for the Wildflower and Indicator lists where the species pool is already restricted, and so it is not like these NPMS specific aggs will be lost below lots of other rare taxa in the dropdown.

However, @andrewvanbreda can you open a non-essential GitHub on the NPMS website GitHub to check the reason behind this, probably the action will be updating the UKSI, or checking that it is fixed when the UKSI is next updated.

@kazlauskis I am happy that the report is now returning sensible frequency results for the ordering.

andrewvanbreda commented 2 months ago

Hi @sacrevert I already added instructions below the box on the report page before you scrubbed previous comment. I will take a look and raise in Github if reason not immediately obvious.

sacrevert commented 2 months ago

I also admit that I don't understand why these aggregates have common names as synonyms, but don't return a default common name. Looking at the UKSI sandbox is unenlightening (e.g. comparing Juncus inflexus and the aggregate). Perhaps you can add this to the issue too.

sacrevert commented 1 month ago

This is just a note to say that the missing default common name issue has been dealt with here: https://github.com/BiologicalRecordsCentre/iRecord/issues/1617

As far as I understand it, these updates mean that we should regenerate the species report for @kazlauskis before launching the alpha version of the app, in order to pick up any such changes

andrewvanbreda commented 1 month ago

Hi @sacrevert Yes. You should just rerun the report to get the latest data, but no further coding should be required. It is a good idea to double-check any problems you were previously concerned about have gone when you rerun the report. My report takes its data from the cache tables, so is reliant on those being up to date.

NERC-CEH / npms-app

Notes on the logic and data structures for NPMS Website, Plant Portal NPMS+, and Plant Portal Standard Mode #78