NVC floristic import observations

andrewvanbreda commented 7 years ago

Nested in #46

Again similarly to the thread I just opened for the Plant Att import, I am opening a thread for the floristic import. I haven't taken a look at it yet to refresh my memory but there looks to be some committed code for it. I am less sure that this is complete though in comparison to the Plant Att importer. Will take a look and let you know if it is ready for a run, however my guess is there are some outstanding questions for it

andrewvanbreda commented 7 years ago

Something to note about this is I don't have an nvc_floristic import file with preferred_tvks in, they are all species names. I have copied in the preferred_tvks from the Plant Att data into my file so I have preferred_tvks to test my code with (it doesn't matter too much too me that they are the wrong keys), that is OK for testing, but we need the real ones for the final run if you have them, or we can extract them from the DB with some SQL. We don't really want to be matching using species name if we can avoid it. I have just managed to persuade the importer to run without error, although I have yet to examine the data. I suspect some of it won't have imported until I change the code

sacrevert commented 7 years ago

Sorry to be dim, but are you saying that the user will need to provide TVKs?

andrewvanbreda commented 7 years ago

Hi Oli, As far as I am aware the user doesn't deal with the nvc floristic import and it is a one-off import of data performed by BRC. I have a data file David provided a while back. This would preferably have TVKs in instead of species names. The matching could be done with species names, but it is less reliable than an exact TVK. From a coding point of few it doesn't make it more difficult either way, the benefit is really for BRC to get the best import. David was able to provide a file with TVKs for Plant Att so I though nvc floristic might be similar. It is up to you, to my coding difficulty it doesn't make much difference

sacrevert commented 7 years ago

Got you. Sorry, I didn't initially twig that you were talking about the NVC tables. Yes, it would be better if we labelled the species with TVKs. I'll get on to this.

andrewvanbreda commented 7 years ago

Note (mainly to myself): The Community/Sub-community automatic termlist generation has passed testing with random selection of data. One observation is that there is a community "#N/A", but am unsure if this should be an actual community to attach to, or whether we just ignore the community completely in this case. I expect the latter, but the importer currently uses an '#N/A' community. Easy to change if required

andrewvanbreda commented 7 years ago

Note (mainly to myself): The Community/Sub-community attributes (Vegetation height cm) import into the new termlist_term_attribute_values has passed testing

andrewvanbreda commented 7 years ago

Hi, I have now implemented the taxa/Community/Sub-community attributes. These was some debate about this as it was hard to implement as it is a taxon/community combination the attribute is for. Originally we were going to use JSON encoding. In the end I have fallen back on another idea that I think has worked and that is to use the source id in the taxa_taxon_list_attribute_value to point at the community. This seems to work. If we still need the source field to say something like plant portal, the taxa_taxon_list_attribute (not value) source id is still available to use.

The NVC floristic importer is now ready for a test run at BRC. I will now include Biren in this thread.

andrewvanbreda commented 7 years ago

@BirenRathod this importer is ready for a test run.

We will probably do a similar thing to last time (Plant Att) and run it and then get Oli to examine the results. So perhaps when Oli is back you two can decide when you are both available to give it a run.

If you give me half a day notice I can send you the SQL in one file to make it easier to run (and put in the ids like taxa taxon list to make it easier to run for you)

sacrevert commented 7 years ago

@BirenRathod Biren - do you just want to go ahead with the test when you have a moment, and report the results to Andy and myself? I think my time would be better spent adding in the TVKs for the species in the NVC table.

BirenRathod commented 7 years ago

Yes, I can do it.

When @andrewvanbreda is ready to send me the next batch. I think tomorrow (Tuesday 13/6) first day in the morning will be fine.

andrewvanbreda commented 7 years ago

Actually that is a good point, I have forgotten we don't have the required TVKs yet (my test file is just full of fake ones). So we have 3 choices

I change the code to use the species name instead, purely for the purposes of the test. I don't think I would recommend this as we want to test the same code that would be run on the final warehouse.
We wait until you have finished getting the tvks
Or we could just get the tvks for a few rows, then perform a test, then perform the rest of the test later.

I think I might recommend idea 3.

Oli, is it a lot of work to fill in the TVKs?. Could I just write biren a database query and try to fill in the tvk column like that? It might be worth at least giving that a go, even if it doesn't work fully especially if it takes you hours

sacrevert commented 7 years ago

Andy - if you send me the current file, I will try and do it this afternoon. I think that the database query route will need sense-checking regardless, so probably easier for me to do it from scratch.

andrewvanbreda commented 7 years ago

That is very true, it would be no more reliable than if I had used the name to match in the first place, so my idea makes little sense actually. Just replying to your email now

andrewvanbreda commented 7 years ago

I have sent the email now, can you let me know if it arrives. There should be a spreadsheet attached to it.

andrewvanbreda commented 7 years ago

Sending failed, will try and attach to this thread in a sec

andrewvanbreda commented 7 years ago

NVC_floristic_tables_26_3_2008.xls.zip

andrewvanbreda commented 7 years ago

No, that must be it. It is now attached to this thread also. And that is the roughly size of it (unzipped). My computer is a reporting a very slight difference in size, but I think as I am running a mac that is probably due to the way the disc is formatted.

Once you are done with it, send it back to me as I will need to make slight adjustments to the way the file is formatted before importing.

sacrevert commented 7 years ago

NVCtables_v2.zip

Here is the new version of the file. I have stripped the occasional trailing white spaces from the species names, and entered the (currently) preferred TVK for vascular plants and bryos. Rows not representing species have #N/A in that column; species (lichens and seaweeds) which do not yet have a TVK are given a 0. @BirenRathod @andrewvanbreda

andrewvanbreda commented 7 years ago

Hi, I am downloading this now and will run on my own machine this morning. Hopefully won't take too long, just make sure it runs fine with the new TVKs, the code SHOULD already work, but you never know I suppose with the new file Will get back to you as soon as I can.

andrewvanbreda commented 7 years ago

Hi, Sorry for the delay, what I was trying to do on my machine ended up being quite a faff for a lot of reasons. Performance and file size issues trying to import the species/tvks from the import file, also needed to change code to take into account file changes, also needed to adjust the format of the file anyway. Then some of my tests failed with the altered code. Anyway, I think it is done now. @BirenRathod, in a second I am going to attach the files. One is the import data file you will need to use. The other is the sql, the only change you will need to make to the import file is file path, same as Plant Att. This needs to be run on the dev warehouse, same as the Plant Att test.

Two further things to note. I think we need to examine these results very carefully, compared to Plant Att this importer is doing a lot of entirely new things, also on my own box it has been impossible to reproduce the exact database conditions the importer will find when running at BRC. For this latter reason I am also very unsure of what the performance will be like. I couldn't import the species very easily on my own box and there were lots of duplicates causing a very slow import. Fingers crossed that a species list without duplicates will perform better.

One last thing, the species without a tvk (marked with TVK 0 currently) will not have any attributes imported during this test

andrewvanbreda commented 7 years ago

NVCtables_v2_avb_formatted.csv.zip

andrewvanbreda commented 7 years ago

Obviously u will need to unzip file

sacrevert commented 7 years ago

Thanks Andy. I look forward to the test results. I'm not sure I understand the point about duplicates. Species that appear in more than one community type are an essential part of this dataset. This dataset will only be imported once - so, is performance that much of an issue?

andrewvanbreda commented 7 years ago

nvc_import_for_biren.txt

andrewvanbreda commented 7 years ago

All yours Biren, just let me know if you have any questions. You will need to check with David/Oli (or whoever) the priority level of running this, no rush from my end. Needs running on Dev warehouse

sacrevert commented 7 years ago

@BirenRathod Not a priority, but good to keep the momentum as we are making good progress.

BirenRathod commented 7 years ago

I have now uploaded data from CSV. CSV took just 1 sec to upload data.

Hear is the status of each script.

First script took 16.6 secs. Second script took 43.2 secs. Third script took 62 msecs. Fourth script took 1:59 mins Fifth script took 1:56 mins

andrewvanbreda commented 7 years ago

Hi Biren,

Performance was a complete unknown, those numbers are fine. Especially for something that only runs once. Thanks for running that Biren The next step will be to inspect the results. @sacrevert would you like me to write some SQL again? This one is actually more complex than the Plant Att, so I will need to explain what I have done so we are all happy with it. There are lots of new things we haven't done before.

sacrevert commented 7 years ago

@andrewvanbreda Yes, that would be useful. Thanks.

andrewvanbreda commented 7 years ago

Hi Oli,

OK sure, I will do that soon. David said NPMS is currently higher priority, however I also expect to have that finished shortly anyway.

andrewvanbreda commented 7 years ago

@sacrevert

I will write you some sql for the NVC import test in a separate message (not tonight). However here is a quick overview of the data structure. I won't spend too much time explaining as it is obviously time that can be spent on other things, however it is worth explaining to some extent for a few reasons

A one off import is important to get right at the start
My interpretation of the data structure might not be correct, you might say "oh no that is completely wrong"
It will aid your understanding of Indicia also.

The basic data structure is this

The Communities and Sub Communities are held in a termlist hierarchy.

The Codes are held as synonyms.

The community level attributes such as Number of species per sample are held in the new termlist terms attribute values. The termlist term attribute tables are new but work in the same way as other attributes

The problematic attributes are the Species Constancy Value and Maximum Abundance of Species. These are held against a combination of the community and species. To get round this issue we save it as a taxa_taxon_list_attribute_value as usual.....but we use the source_id field on the taxa_taxon_list_attribute_value to point to the termlist term item for the community/sub-community This way each taxa taxon list attribute's value points to a community

OK that is it. I have tried to put it as simply as possible. Let me know if you have questions However keep in mind the importer is logically quite a lot more complex than the plant att one even though there less data and possibly less code. For instance, we have to extract community level attributes from the species column. Also we are using new things like the termlist term attributes. For this reason I would suggest we are even more careful with it than the plant att importer (although I have tested it with 100% success on my own box)

andrewvanbreda commented 7 years ago

@sacrevert seeing as the community level attributes such as "Number of species per sample" work in the same way as other attributes (but are instead just termlist terms attributes). Do you want to see if you can test the data for those without assistance from me, should be similar to Plant Att. Then I can just write you the SQL for the community/taxon combination attributes

sacrevert commented 7 years ago

NVCtables_v3.zip

Here is what should be the final version of the file. Again, I have stripped the occasional trailing white spaces from the species names, and entered the (currently) preferred TVK for vascular plants, bryos, lichens and seaweeds. Rows not representing species have NA in the TVK column. There should now be no species (or genera) without a TVK. @BirenRathod @andrewvanbreda

andrewvanbreda commented 7 years ago

@sacrevert Just loading the data onto my new laptop so I can write that sql for you.

I noticed however the supplied file won’t work with the importer without alteration.

For future reference (for me) the alterations I seem to have needed to make are as follows

All instances of NA replaced with empty string
Near the end of the file there are a few rows where I had to put empty cells in, these are indicated by empty commas like ,,, at the end of rows. Some of the empty cells were missing entirely from existing rows
Removal of headerless first column with numbers

I am pretty sure I also had to make these changes on the previous file NVCtables_v2 file before recent changes. So I am hopeful the alterations I have made aren’t significant.

But this still leaves a slight issue, as we have already done the dev warehouse run on the previous version of the file, so the next step after testing that would of been a live run.

Running the importer on live with a new version of the file for the first time clearly won't be an acceptable situation. However me spending time testing it on my own machine when we believe the risk is small also isn't good given budget constraints.

So my suggested course of action would is

Me supply a newly formatted version of the new file in a minute
Me write sql for Oli on my own box, and to test that SQL run it against the import I have just run with the new file on my machine (Although these tests will not be lengthy like they were with the previous version of the file)
Oli to continue with testing the Dev Warehouse with the SQL I will supply. Note down all tests run on the dev warehouse
Do the final run eventually, which will consist of a) Confirmation run on test warehouse b) Rerun the dev warehouse tests on the test warehouse (quickly hopefully as we will already know where what the tests to do are) c) If they pass then run immediately on live

andrewvanbreda commented 7 years ago

SQL to follow here. Notes on use. For simplicity of programming (actually simplicity of testing the sql works) I have supplied separate sql for Maximum abundance of species and species constancy value. You need need to replace the tag with the tvk you wish to use. You need to replace the tag with the main community (if there isn't a parent) or the sub-community (if there is a parent). Do not specify the main community in the case of a sub-community, as you will get several results when you run the SQL and you can just choose the applicable one (as I display the parent community in a column for you). Again this is to simplify the programming This might sound confusing but hopefully you will see if is straightforward when you still running it, there are examples in the sql comment on the lines in question

andrewvanbreda commented 7 years ago

select ttla.caption as Attribute,ttlav.int_value as maximum_abundance_of_species ,t_parent_com_code.term as parent_community_code_if_applicable,t_parent_com.term as parent_community_if_applicable,t_source_code.term as community_or_sub_community_code,t_source.term as community_or_sub_community, t.taxon as taxon, t.external_key as tvk from indicia.taxa_taxon_list_attribute_values ttlav join indicia.taxa_taxon_lists ttl on ttl.id = ttlav.taxa_taxon_list_id and ttl.deleted=false join indicia.taxa t on t.id = ttl.taxon_id AND t.deleted=false AND t.external_key = -- e.g 'NBNSYS0000003159' join indicia.taxa_taxon_list_attributes ttla on ttla.id = ttlav.taxa_taxon_list_attribute_id AND ttla.caption = 'Maximum abundance of species' AND ttla.deleted=false join indicia.termlists_terms tt_source on tt_source.id = ttlav.source_id AND tt_source.deleted=false join indicia.terms t_source on t_source.id = tt_source.term_id and t_source.deleted=false AND t_source.term = -- e.g. 'Polygonum aviculare-Ranunculus repens sub-community'. join indicia.termlists_terms tt_source_code on tt_source_code.meaning_id = tt_source.meaning_id and tt_source_code.id != tt_source.meaning_id and tt_source_code.deleted=false join indicia.terms t_source_code on t_source_code.id = tt_source_code.term_id and tt_source_code.deleted=false left join indicia.termlists_terms tt_parent_com on tt_parent_com.id = tt_source.parent_id and tt_parent_com.deleted=false left join indicia.terms t_parent_com on t_parent_com.id = tt_parent_com.term_id and t_parent_com.deleted=false left join indicia.termlists_terms tt_parent_com_code on tt_parent_com_code.meaning_id = tt_parent_com.meaning_id and tt_parent_com_code.id != tt_parent_com.id and tt_parent_com_code.deleted=false left join indicia.terms t_parent_com_code on t_parent_com_code.id = tt_parent_com_code.term_id and tt_parent_com_code.deleted=false;

andrewvanbreda commented 7 years ago

select ttla.caption as Attribute,t_species_con_val.term as species_constancy_value ,t_parent_com_code.term as parent_community_code_if_applicable,t_parent_com.term as parent_community_if_applicable,t_source_code.term as community_or_sub_community_code,t_source.term as community_or_sub_community, t.taxon as taxon, t.external_key as tvk from indicia.taxa_taxon_list_attribute_values ttlav join indicia.termlists_terms tt_species_con_val on tt_species_con_val.id = ttlav.int_value and tt_species_con_val.deleted=false join indicia.terms t_species_con_val on t_species_con_val.id = tt_species_con_val.term_id and t_species_con_val.deleted=false join indicia.taxa_taxon_lists ttl on ttl.id = ttlav.taxa_taxon_list_id and ttl.deleted=false join indicia.taxa t on t.id = ttl.taxon_id AND t.deleted=false AND t.external_key = -- e.g 'NBNSYS0000003797' join indicia.taxa_taxon_list_attributes ttla on ttla.id = ttlav.taxa_taxon_list_attribute_id AND ttla.caption = 'Species constancy value' AND ttla.deleted=false join indicia.termlists_terms tt_source on tt_source.id = ttlav.source_id AND tt_source.deleted=false join indicia.terms t_source on t_source.id = tt_source.term_id and t_source.deleted=false AND t_source.term = -- e.g. 'Lolium perenne-Plantago lanceolata grassland'. join indicia.termlists_terms tt_source_code on tt_source_code.meaning_id = tt_source.meaning_id and tt_source_code.id != tt_source.meaning_id and tt_source_code.deleted=false join indicia.terms t_source_code on t_source_code.id = tt_source_code.term_id and tt_source_code.deleted=false left join indicia.termlists_terms tt_parent_com on tt_parent_com.id = tt_source.parent_id and tt_parent_com.deleted=false left join indicia.terms t_parent_com on t_parent_com.id = tt_parent_com.term_id and t_parent_com.deleted=false left join indicia.termlists_terms tt_parent_com_code on tt_parent_com_code.meaning_id = tt_parent_com.meaning_id and tt_parent_com_code.id != tt_parent_com.id and tt_parent_com_code.deleted=false left join indicia.terms t_parent_com_code on t_parent_com_code.id = tt_parent_com_code.term_id and tt_parent_com_code.deleted=false;

andrewvanbreda commented 7 years ago

Oli, you should be able to write SQL for the community level attributes yourself.

In the database I have these listed as follows

"Bryophyte cover %" "Bryophyte height mm" "Ground layer cover %" "Ground layer height mm" "Herb cover %" "Herb height cm" "Mean total cover" "Number of species per sample" "Shrub cover %" "Shrub height cm" "Shrub height m" "Shrub/herb cover %" "Shrub/herb height cm" "Tree cover %" "Tree height" "Tree/shrub cover %" "Vegetation cover %" "Vegetation height cm" "tree/shrub height"

Obviously that list is only correct if the code is correct. Don’t forget these are termlist_term attributes, but the your sql should work in much the same way as any other kind of attribute

I think that is it from me hopefully. Actually I just noticed we might want a source_id on the attributes like PlantAtt has. However I am also noticing the termlists_terms_attributes doesn’t even have this column in the database table. I don’t want for us to get confused with the situation. Please continue with the tests, ignore what I just said about the sources and I will put that on my to do list for further discussion

andrewvanbreda commented 7 years ago

NVCtables_v3_avb_formatted.txt

Newly formatted file for the importer @sacrevert I have just noticed that actually the first column (which was removed from this file) is not the same as the row number column by the time you reach the end of the file. Neither of these columns is imported. The headerless column just seems to display the row count in the import file, but I have actually only just noticed the "row number" column itself is not an actually count of rows in the file. Is this column significant as we aren't importing it? e.g. "Row number" 57780 is actually on file row file 47907

sacrevert commented 7 years ago

@andrewvanbreda sorry Andy, I missed this question. No, the row.number column is not important to us. I was a bit concerned that there were 'missing' rows however (e.g. between 40 and 53), but checking back to the original NVC communities table available from JNCC shows that the discrepancy is also present in their table, and thus must result from some historic formatting issue. For the row 40/53 example I the checked the original printed book, and there is no missing information for this community. So, there does not appear to be any missing data, and the row number is not important anyway.

sacrevert commented 7 years ago

FYI I am now looking at the SQL/understanding the data structure and will get back to you later.

andrewvanbreda commented 7 years ago

@sacrevert OK, just let me know when you are ready. I think there is still a question about whether we need to do something with sources, but we'll worry about your findings first before discussing that

sacrevert commented 7 years ago

@andrewvanbreda The maximum abundance of species SQL works as expected on dev, however, I can’t get the ‘Species constancy value’ example to work, and I’ve tried several TVK/community name combinations, including ones that work in the maximum abundance SQL. Here is an example of what is not working (with comments removed):

Correction: it does work, but does not return results.

select ttla.caption as Attribute,t_species_con_val.term as species_constancy_value ,t_parent_com_code.term as parent_community_code_if_applicable,t_parent_com.term as parent_community_if_applicable,t_source_code.term as community_or_sub_community_code,t_source.term as community_or_sub_community, t.taxon as taxon, t.external_key as tvk from indicia.taxa_taxon_list_attribute_values ttlav join indicia.termlists_terms tt_species_con_val on tt_species_con_val.id = ttlav.int_value and tt_species_con_val.deleted=false join indicia.terms t_species_con_val on t_species_con_val.id = tt_species_con_val.term_id and t_species_con_val.deleted=false join indicia.taxa_taxon_lists ttl on ttl.id = ttlav.taxa_taxon_list_id and ttl.deleted=false join indicia.taxa t on t.id = ttl.taxon_id AND t.deleted=false AND t.external_key = 'NBNSYS0000002544' join indicia.taxa_taxon_list_attributes ttla on ttla.id = ttlav.taxa_taxon_list_attribute_id AND ttla.caption = 'Species constancy value' AND ttla.deleted=false join indicia.termlists_terms tt_source on tt_source.id = ttlav.source_id AND tt_source.deleted=false join indicia.terms t_source on t_source.id = tt_source.term_id and t_source.deleted=false AND t_source.term = 'Polygonum aviculare-Ranunculus repens sub-community' join indicia.termlists_terms tt_source_code on tt_source_code.meaning_id = tt_source.meaning_id and tt_source_code.id != tt_source.meaning_id and tt_source_code.deleted=false join indicia.terms t_source_code on t_source_code.id = tt_source_code.term_id and tt_source_code.deleted=false left join indicia.termlists_terms tt_parent_com on tt_parent_com.id = tt_source.parent_id and tt_parent_com.deleted=false left join indicia.terms t_parent_com on t_parent_com.id = tt_parent_com.term_id and t_parent_com.deleted=false left join indicia.termlists_terms tt_parent_com_code on tt_parent_com_code.meaning_id = tt_parent_com.meaning_id and tt_parent_com_code.id != tt_parent_com.id and tt_parent_com_code.deleted=false left join indicia.terms t_parent_com_code on t_parent_com_code.id = tt_parent_com_code.term_id and tt_parent_com_code.deleted=false;

sacrevert commented 7 years ago

hi @andrewvanbreda

On our dev warehouse the species constancy term ids are as follows (ids = 11450-11454); see:

select * from terms order by created_on desc limit 20;

In your SQL this is joined to termlists_terms (tt_species_con_val) as folllows

join indicia.terms t_species_con_val on t_species_con_val.id = tt_species_con_val.term_id and t_species_con_val.deleted=false

So the termlists_terms.id's are 12486-12490 (see the results of following SQL)

select * from termlists_terms where term_id between 11450 and 11454;

In your SQL the termlists_terms are then joined to the taxa_taxon_lists_attribute_values (ttlav) as follows:

join indicia.termlists_terms tt_species_con_val on tt_species_con_val.id = ttlav.int_value and tt_species_con_val.deleted=false

so the values 12486-12490 should appear in the ttlav field int_value. However, they do not appear to:

select * from taxa_taxon_list_attribute_values where int_value between 12486 and 12490;

returns no results at the moment.

andrewvanbreda commented 7 years ago

@sacrevert OK, those termlist_term ids are definitely right as I checked the warehouse UI.

What does the following do? (this is a slightly different checks as the attribute values could be there but the int_value might not be populated)

select count (id) from indicia.taxa_taxon_list_attribute_values where taxa_taxon_list_attribute_id = 82;

sacrevert commented 7 years ago

That returns 0. So I suppose that means it is not populated?

andrewvanbreda commented 7 years ago

it does, I will have a think why that might of happened. Don't forgot that apart from on my own machine, this was the first run. I will get back to you about it. I will look at the sql and see how it might of failed to work

sacrevert commented 7 years ago

Fine, no problem.

andrewvanbreda commented 7 years ago

OK, I think I can see what is wrong. That attribute seems to have been added as a text attribute instead of a lookup. What is causing this is actually only one character in the code. I am sure I saw this working fine on my box, so maybe it accidentally got changed after, or perhaps my machine has a setting that is causing it not to be an issue on my box. I think what will happen is I will apply the fix on my box, have a quick test. Then we can just rerun that bit of the code again on the dev warehouse. I don't think it is worth doing a full rerun as we will do a full test warehouse rerun anyway just before live.

Just to check one thing, did you manage to check the community level attributes were OK?

sacrevert commented 7 years ago

Just spotted a separate issue:

select * from terms order by created_on desc limit 100;

Note here that at some point you have done a find-and-replace on 'na's on the spreadsheet, and this has also deleted the string 'na' from all community names

e.g. see

select * from terms where id = 9876;

This should be 'Festuca ovina ...', but is "Festuca ovi-Avenula pratensis grassland" on the dev warehouse at the moment.

BiologicalRecordsCentre / BSBI-Card-and-PlantPortal-DEPRECATED-

NVC floristic import observations #56