AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing
Other
7 stars 24 forks source link

ExportFromIndexStream with filter file, filters locally rather than on the server #206

Open ansell opened 7 years ago

ansell commented 7 years ago

ExportFromIndexStream when run with a filter file does all filtering locally rather than on the server. The key line seems to be:

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/export/ExportFromIndexStream.scala#L373

This prevented me from splitting up the single monolithic bulk downloads regeneration job into separate jobs because each of the separate jobs would need to download every record, making them unviable at the current records-per-second performance: https://github.com/AtlasOfLivingAustralia/maintenance/issues/26

Workaround is to keep running the 47 hour bulk downloads regeneration jobs monthly.

djtfmartin commented 6 years ago

Ive added the 1.9x label to this as the filter file isnt supported currently in 2.x (we arent using the ByteOrderPartitioner so key ranges arent possible).

ansell commented 6 years ago

The filter file is required for when we switch to 2.x to avoid pressure on the downloads service.

@M-Nicholls this feature, generating offline downloads so users can avoid using the live downloads service, is not yet implemented for 2.x. If we switch over before it is implemented, the downloads on downloads.ala.org.au will no longer be updated until it is implemented.