Open timrobertson100 opened 4 years ago
@timrobertson100 Nice! How about a second command:
$ cat input.dwca | dwca-tools --format JSON > output.json
I'm not entirely sure how a cat input.dwca
would/could work here @jhpoelen
The input is a zip file, that on opening requires file sorting in order to implement joins without reading the whole thing into memory. The reader will also disregard supplementary files not of interest from the zip manifest. Streaming the output would be no issue.
I think it'd need to be:
dwca-tools --format JSON /input.dwca > output.json
Or am I missing something, please?
I can see your point about the auxiliary files. However, I figure you can stream to content of the dwca into temporary files instead of keeping in memory. If the meta.xml occurs first, then you can already start ignoring auxiliary files and even start building / sorting that star schema model. My main point is that, especially given the size of most datasets, streaming processing is pretty important to keep storage/memory overhead low, as you noticed in the jenkins/spark setup you have now.
Perhaps worth a hacking session . . . ; )
Thanks @jhpoelen
can stream to content of the dwca into temporary files instead of keeping in memory
You may not be aware, but this is what the library already does internally when there are extensions (uses a sort to temporary files following by a streaming join). When no extensions exist it just streams directly from the file. Large archives will run on minimal memory and I still don't see an obvious way that this can be improved.
The main issues I'm aware of for efficient DwC-A reading are 1) the sorting and 2) the deflate compression step as it applied to individual files and not chunks of the files. Both of these are inherently single-threaded operations and where the processing overhead comes from. Sorting is better in Linux environments if there is access to a native gnuSort
process.
These limitations are why GBIF converts DwC-A into Avro as a first step, after which we can run processing in parallel - e.g. in Spark. I suspect this is what you're referring to, but I don't think that is an issue with this library.
(Where a DWCA contains extensions, extracting and sorting the data files in parallel would give some improvement to the overall speed.)
I think we could take some inspiration from Avro tools. At least cat
, getmeta
, getschema
, random
, tojson
, totext
seem like the kind of things that would be useful in a CLI.
java -jar avro-tools-1.8.2.jar
Version 1.8.2
of Apache Avro
Copyright 2010-2015 The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).
----------------
Available tools:
cat extracts samples from files
compile Generates Java code for the given schema.
concat Concatenates avro files without re-compressing.
fragtojson Renders a binary-encoded Avro datum as JSON.
fromjson Reads JSON records and writes an Avro data file.
fromtext Imports a text file into an avro data file.
getmeta Prints out the metadata of an Avro data file.
getschema Prints out schema of an Avro data file.
idl Generates a JSON schema from an Avro IDL file
idl2schemata Extract JSON schemata of the types from an Avro IDL file
induce Induce schema/protocol from Java class/interface via reflection.
jsontofrag Renders a JSON-encoded Avro datum as binary.
random Creates a file with randomly generated instances of a schema.
recodec Alters the codec of a data file.
repair Recovers data from a corrupt Avro Data file
rpcprotocol Output the protocol of a RPC service
rpcreceive Opens an RPC Server and listens for one message.
rpcsend Sends a single RPC message.
tether Run a tethered mapreduce job.
tojson Dumps an Avro data file as JSON, record per line or pretty.
totext Converts an Avro data file to a text file.
totrevni Converts an Avro data file to a Trevni file.
trevni_meta Dumps a Trevni file's metadata as JSON.
trevni_random Create a Trevni file filled with random instances of a schema.
trevni_tojson Dumps a Trevni file as JSON.
@MattBlissett @timrobertson100 Am pretty excited about all this. I think having a swiss army knife for dwca would be neat. Sort of the like ffmpeg for dwc.
Re: streaming vs. files - I am somewhat aware of the internals of the current dwca-io - like you say, it does a lot: expanding files, sorting, schema interpretation with some special magic (merging synonymous terms), and merging the various related files to populate a data model.
Re: avro tools inpiration - yes, this looks great. I think there's a lot of similarities between avro and dwca, in the way that both formats contain structured data.
Here's some rough ideas -
dwca-cli | description | avro equivalent |
---|---|---|
schema | prints meta.xml in some format | getschema |
meta | prints eml.xml if available in some format | getmeta |
occurrences | print occurrences if available | n/a |
taxa | print checklist if available in some format | n/a |
media | print media if available in some format | n/a |
data | print entire populated schema in some format | n/a |
... | ... | ... |
You can stream dwca occurrences
provided (1) the meta.xml occurs before the data files or (2) the schema is provided up front via cat archive.zip | dwca getschema
.
Supported formats: tsv, json, avro ?
Note that I ended up implementing my own cli for handling dwca ; ( I guess sometimes you just have to roll your own to fit a specific use case. With the cmd, you can know stream all of gbif using a command like the one described in https://github.com/bio-guoda/preston/issues/148 . Holler if you change your mind on building a cli, happy to give it a spin.
Thanks @jhpoelen If your implementation is something that could be reused and you think would be useful to others to have in this library then PR's are always welcome.
@timrobertson100 for sure.
For now, the dwca streaming functionality is part of the preston cli . a small part of your dwca-io library is re-used (e.g., reading meta.xml / record iterator). Works great! Would be neat to have a small focused library that only does that. Now, all these other dependencies are pulled in.
Is dwc-io still the library you use in the gbif infrastructure?
@timrobertson100 thanks for pointing out that you are no longer using the https://github.com/gbif/dwc-io module, but are using a copied version of it embedded in the "core" module of the https://github.com/gbif/pipelines instead.
Thanks for pointing our the "frictionless" data experiments. Does that mean that GBIF is departing from the DwC ? How are you planning to transition? Would I be correct to assume that this "frictionless" data experiment is related to the big splash you made earlier this month re: https://discourse.gbif.org/t/use-case-biotic-interactions-sottunga-island-melitaea-cinxia-population-study/3312 and other use cases?
Also, just wondering - why didn't you refactor and re-use the https://github.com/gbif/dwca-io module in the pipelines project? And, how are you planning to keep them in sync?
I think you've misread that - pipelines use use dwc-io so consistency is not an issue.
GBIF is not departing from Darwin Core as it's the core standard for much of what GBIF does. We're exploring richer data exchange formats as part of the work to diversity the data model and frictionless looks like a reasonable packaging format to explore.
We also have simple DWCA->AVRO CLI implementation here https://github.com/gbif/pipelines/blob/dev/tools/archives-converters/src/main/java/org/gbif/converters/DwcaToAvroConverter.java#L28
@timrobertson100 I am glad I misread that and thanks for clarifying. Great to see you are re-using existing libraries. I was confused by the what I thought were cloned implementation of the DwcaReader classes. Am still hoping I can convince you and you colleagues to publish to maven central to make it easier to discover and use your valuable libraries (#55 ). Free coffee and cookies? An ice cream? What will it take?
And great to hear that you are extending support for additional schemas beyond dwc . I've very much enjoyed using W3C CSV for many years now. I am curious to see how you'll end up managing the schemas (e.g., versioning) and data (e.g., provenance).
First command: