ExportFromIndexStream performance improvements required

AtlasOfLivingAustralia / biocache-store

Occurrence processing, indexing and batch processing

Other

7 stars 24 forks source link

ExportFromIndexStream performance improvements required #216

Open ansell opened 7 years ago

ansell commented 7 years ago

The performance of ExportFromIndexStream may need to be improved to reduce the time required for the monthly regeneration of the downloads.ala.org.au archives. Currently it takes about 46 hours, which hopefully could be improved to allow it to be run more regularly than once a month to keep the downloads up to date with the biocache.

For reference, Generate GBIF Archives completes in under 4 hours, and it also hits every record.

adam-collins commented 6 years ago

To avoid CSV parsing issues change or add output format of tsv.

ansell commented 6 years ago

@djtfmartin What qualifies this for the "idea" label? It is a serious enough issue that I have more than once considered using another codebase to do the exporting so that exporting does not interrupt the other data management activities.

djtfmartin commented 6 years ago

thanks @ansell. I guess for me it was just a little vague as to what to do here. Im sure there is a problem, but we need detail (and a plan) to action something.

ansell commented 6 years ago

The current functionality is once a month we create three archives of different sizes (small/medium/large) for each of a range of large taxonomic groups (Animals/Fish/Birds/Fungi/etc.) to allow people to avoid using the downloads system.

The only issue right now (as far as I know) is that it takes an unusually long time and has been noticed multiple times by the data team as interfering with our other operations.

Exporting a single small/medium/large taxonomic group archive is close to a similar time right now to the time required to migrate the entire Cassandra from one location to another over a network using concurrent record readers and concurrent record writers (4 hours or so).

ansell commented 6 years ago

screen shot 2018-05-01 at 11 20 22 am

As a reference, it is running right now, and using a consistent 40% CPU on cass-b4, including the Jenkins/biocache-store and the Cassandra CPU usage.

djtfmartin commented 6 years ago

I had a little look at the SOLR streaming API. Looks like we need to use docValues to make use of this feature.