Open ansell opened 7 years ago
To avoid CSV parsing issues change or add output format of tsv.
@djtfmartin What qualifies this for the "idea" label? It is a serious enough issue that I have more than once considered using another codebase to do the exporting so that exporting does not interrupt the other data management activities.
thanks @ansell. I guess for me it was just a little vague as to what to do here. Im sure there is a problem, but we need detail (and a plan) to action something.
The current functionality is once a month we create three archives of different sizes (small/medium/large) for each of a range of large taxonomic groups (Animals/Fish/Birds/Fungi/etc.) to allow people to avoid using the downloads system.
The only issue right now (as far as I know) is that it takes an unusually long time and has been noticed multiple times by the data team as interfering with our other operations.
Exporting a single small/medium/large taxonomic group archive is close to a similar time right now to the time required to migrate the entire Cassandra from one location to another over a network using concurrent record readers and concurrent record writers (4 hours or so).
As a reference, it is running right now, and using a consistent 40% CPU on cass-b4, including the Jenkins/biocache-store and the Cassandra CPU usage.
I had a little look at the SOLR streaming API. Looks like we need to use docValues to make use of this feature.
The performance of ExportFromIndexStream may need to be improved to reduce the time required for the monthly regeneration of the downloads.ala.org.au archives. Currently it takes about 46 hours, which hopefully could be improved to allow it to be run more regularly than once a month to keep the downloads up to date with the biocache.
For reference, Generate GBIF Archives completes in under 4 hours, and it also hits every record.