gbif / gbif-common

Utility classes
Apache License 2.0
1 stars 1 forks source link

Information Request: Resource usage (esp. disk) for `sortInJava`? #32

Open csbrown opened 1 year ago

csbrown commented 1 year ago

Description:

We are contemplating using dwca-io to read DWCA files. However, our file processing pipeline is limited in processing power/time, memory and also disk availability (AWS Lambda has max 15m runtime, 10G RAM and 10G disk). It looks like the dwca-io package sorts the files in the archive to facilitate a "join" operation which we will need. Preferably, the sort operation can be InJava, since vanilla Lambda doesn't have access to GNU sort. Reading the method, it seems as though the method may make up to 2 copies of a file on disk. As an example, the Taxon.tsv file in the DWCA for the GBIF Backbone is currently 2.1G unzipped. 3 of these would then be 6.3G, which is within the 10G limit, but doesn't leave a lot of wiggle room.

I see that there is logging information in this method that reports on disk usage. I was hoping that perhaps you all might be able to provide more information on disk usage from existing logs that you might have.

Request:

1) Is my x3 disk usage estimate approximately accurate? 2) In your experience, are there other DWCA core files that are significantly larger than the 2.1G GBIF Backbone that I should be worried about?
3) If you have any other statistics about runtime and memory usage and disk usage that would also be greatly appreciated.

I understand if you don't have readily available information, and I can run my own experiments. I was just hoping that maybe y'all had vast troves of data on this subject just lying around already. :) Thanks in advance for any information at all.

mdoering commented 1 year ago

Occurrence archives can be significantly larger - eBird with > 1 billion records and Artportalen with > 100 million for example.

Not also that the Java sorting is not as efficient as the GNU sort and we do not use it at GBIF. So there is no (debug) logging we can provide AFAIK. The Java sorting was initially created for the IPT which is required to also run on non *nix machines. Other than the java sorting the dwca-io has a very low footprint.

timrobertson100 commented 1 year ago

Just to add to what Markus wrote - my intuition is that budgeting for 3x the size of the uncompressed core file would be about right for sorting. (edited to add: looking at the code 3x is indeed required. It splits the input into N parts, sorts each in memory writing the output to intermediate files, and then merges those into the output by using multiple cursors each reading a sorted part).

There are not many, if any, checklists that would be bigger than the backbone. Occurrence datasets would likely blow your limit though now, or in the near future.

I think you’re probably safe to start with Lamda, knowing you could add something later to handle occasional large datasets in a special path (e.g. an SQS that triggered a temporary VM to come up).

Curious to know what you’re building?

Thanks

csbrown commented 1 year ago

I'm at the USGS, and we're in the early stages of building out an invasive species infosystem [1]. We decided pretty early on that we didn't want to be in the business of adjudicating phylogenetic concepts, and to lean on the GBIF backbone and ITIS (a lot of the data we take in have ITIS ids already).

We are currently also ingesting limited Occurrence records (from north-american invasives-specific datasets such as NAS). We do not have a need AFAIK to ingest eBird, but this is good to know. We've been collaborating with folks at INBO who are trying to orchestrate a DWC extension related to invasive species management, which will hopefully make DWCA Occurrence-core records with the invasives extension the primary vehicle for our data partners to share and us to ingest data. One day. :)

Anyhoo, thanks kindly for all of the feedback on dwca-io, this has been very informative.

[1] https://geonarrative.usgs.gov/nationaledrrframework/