AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing
Other
7 stars 24 forks source link

Index new layers in a single reindex #315

Closed ansell closed 5 years ago

ansell commented 5 years ago

Currently it requires two reindexes to add new layers to biocache (and hence to have them usable by spatial and other systems that search occurrence records):

https://github.com/AtlasOfLivingAustralia/biocache-store/issues/312#issuecomment-456239319

It would be useful to instead compute and store the full list of spatial layers during resampling, when the full list is known, and use that instead of relying on a scan through cassandra to discover the layers again.

This would be very useful to fix as a priority given that we are adding new layers in the upcoming period fairly often.

This may also be causing user assertions to partially fail during the period where the fields are in cassandra but not solr, so reducing that time period would be ideal:

https://github.com/AtlasOfLivingAustralia/biocache-hubs/issues/140#issuecomment-455042690

adam-collins commented 5 years ago

This is now achieved by using the SOLR instance schema instead of the schema at the solrConfigXmlPath /data/solr/biocache/conf/solrconfig.xml (default).

To use the SOLR instance schema during 'index-local-node' delete the solrConfigXmlPath directory before running.

New fields are now added to the SOLR instance schema:

Still outstanding on this issue is adding the new fields that are found in additionalFields.list to the SOLR instance schema so they will be available in a second reindex. This will only apply to new misc fields records that are not indexed online.

ansell commented 5 years ago

We deployed the commit above to the aws-bstore nodes yesterday and ran a reindex overnight but it didn't pick up the new fields, either the raw_sampling_protocol which we need for https://github.com/AtlasOfLivingAustralia/biocache-service/issues/317 or the new cl10906/etc. layers that we need to get layer loading started again. Both of those have been present in the additionalFields.list file for weeks and aren't being affected by indexing.

djtfmartin commented 5 years ago

I'm unclear at the moment how SOLR schema changes are being propagated automatically with index-local-node.

To test raw_sampling_protocol indexing, I tested with SOLR 7.x locally and have made the explicit changes to the schema.

https://github.com/AtlasOfLivingAustralia/ala-install/blob/master/ansible/roles/solrcloud_config/files/biocache/conf/schema.xml#L864

There isn't a need to include raw_sampling_protocol in additionalFields.list.

ansell commented 5 years ago

@sat01a Not sure why this has been closed on your board. It is definitely still not fixed.

djtfmartin commented 4 years ago

@adam-collins is there any reason why the steps to delete the solrConfigXmlPath directory isnt the default for index-local-node ?

noting in our jenkins jobs we do this:

echo 'Removing old directories first.'
rm -rf /data/solr/merged_*
rm -rf /data/solr/solr-create/biocache/data*

echo 'Removing old config so indexing fetches current config from SOLR Cloud'
rm -rf /data/solr/biocache/conf

This has caused some confusion with the LA community when folks have tried to add cl* fields to the index, so im thinking we should just delete those directories by default for index-local-node

cc @vjrj

vjrj commented 4 years ago

Thanks indeed Dave for your support as usual.

After rm -rf /data/solr/biocache/conf I get my index with layers fields in our new sorl7 server.

Also because of this step we were suffering other related issues like https://github.com/AtlasOfLivingAustralia/biocache-service/issues/368 that I'm gonna close now.

I'm trying to document this process here https://github.com/AtlasOfLivingAustralia/documentation/wiki/SOLR-Admin-Tasks so other nodes don't have the same problems. Can anybody improve and verify this page?

Maybe just getting that info from your jenkins jobs.

Thks!