gbif / dwca-io

Darwin Core Archive IO
Apache License 2.0
7 stars 9 forks source link

Implement a CLI #53

Open timrobertson100 opened 4 years ago

timrobertson100 commented 4 years ago

First command:

dwca-tools --format JSON /input.dwca /tmp/output.json
jhpoelen commented 4 years ago

@timrobertson100 Nice! How about a second command:

$ cat input.dwca | dwca-tools --format JSON > output.json 
timrobertson100 commented 4 years ago

I'm not entirely sure how a cat input.dwca would/could work here @jhpoelen

The input is a zip file, that on opening requires file sorting in order to implement joins without reading the whole thing into memory. The reader will also disregard supplementary files not of interest from the zip manifest. Streaming the output would be no issue.

I think it'd need to be:

dwca-tools --format JSON /input.dwca > output.json

Or am I missing something, please?

jhpoelen commented 4 years ago

I can see your point about the auxiliary files. However, I figure you can stream to content of the dwca into temporary files instead of keeping in memory. If the meta.xml occurs first, then you can already start ignoring auxiliary files and even start building / sorting that star schema model. My main point is that, especially given the size of most datasets, streaming processing is pretty important to keep storage/memory overhead low, as you noticed in the jenkins/spark setup you have now.

jhpoelen commented 4 years ago

Perhaps worth a hacking session . . . ; )

timrobertson100 commented 4 years ago

Thanks @jhpoelen

can stream to content of the dwca into temporary files instead of keeping in memory

You may not be aware, but this is what the library already does internally when there are extensions (uses a sort to temporary files following by a streaming join). When no extensions exist it just streams directly from the file. Large archives will run on minimal memory and I still don't see an obvious way that this can be improved.

The main issues I'm aware of for efficient DwC-A reading are 1) the sorting and 2) the deflate compression step as it applied to individual files and not chunks of the files. Both of these are inherently single-threaded operations and where the processing overhead comes from. Sorting is better in Linux environments if there is access to a native gnuSort process.

These limitations are why GBIF converts DwC-A into Avro as a first step, after which we can run processing in parallel - e.g. in Spark. I suspect this is what you're referring to, but I don't think that is an issue with this library.

MattBlissett commented 4 years ago

(Where a DWCA contains extensions, extracting and sorting the data files in parallel would give some improvement to the overall speed.)

I think we could take some inspiration from Avro tools. At least cat, getmeta, getschema, random, tojson, totext seem like the kind of things that would be useful in a CLI.

java -jar avro-tools-1.8.2.jar
Version 1.8.2
 of Apache Avro
Copyright 2010-2015 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).
----------------
Available tools:
          cat  extracts samples from files
      compile  Generates Java code for the given schema.
       concat  Concatenates avro files without re-compressing.
   fragtojson  Renders a binary-encoded Avro datum as JSON.
     fromjson  Reads JSON records and writes an Avro data file.
     fromtext  Imports a text file into an avro data file.
      getmeta  Prints out the metadata of an Avro data file.
    getschema  Prints out schema of an Avro data file.
          idl  Generates a JSON schema from an Avro IDL file
 idl2schemata  Extract JSON schemata of the types from an Avro IDL file
       induce  Induce schema/protocol from Java class/interface via reflection.
   jsontofrag  Renders a JSON-encoded Avro datum as binary.
       random  Creates a file with randomly generated instances of a schema.
      recodec  Alters the codec of a data file.
       repair  Recovers data from a corrupt Avro Data file
  rpcprotocol  Output the protocol of a RPC service
   rpcreceive  Opens an RPC Server and listens for one message.
      rpcsend  Sends a single RPC message.
       tether  Run a tethered mapreduce job.
       tojson  Dumps an Avro data file as JSON, record per line or pretty.
       totext  Converts an Avro data file to a text file.
     totrevni  Converts an Avro data file to a Trevni file.
  trevni_meta  Dumps a Trevni file's metadata as JSON.
trevni_random  Create a Trevni file filled with random instances of a schema.
trevni_tojson  Dumps a Trevni file as JSON.
jhpoelen commented 4 years ago

@MattBlissett @timrobertson100 Am pretty excited about all this. I think having a swiss army knife for dwca would be neat. Sort of the like ffmpeg for dwc.

Re: streaming vs. files - I am somewhat aware of the internals of the current dwca-io - like you say, it does a lot: expanding files, sorting, schema interpretation with some special magic (merging synonymous terms), and merging the various related files to populate a data model.

Re: avro tools inpiration - yes, this looks great. I think there's a lot of similarities between avro and dwca, in the way that both formats contain structured data.

Here's some rough ideas -

dwca-cli description avro equivalent
schema prints meta.xml in some format getschema
meta prints eml.xml if available in some format getmeta
occurrences print occurrences if available n/a
taxa print checklist if available in some format n/a
media print media if available in some format n/a
data print entire populated schema in some format n/a
... ... ...

You can stream dwca occurrences provided (1) the meta.xml occurs before the data files or (2) the schema is provided up front via cat archive.zip | dwca getschema.

Supported formats: tsv, json, avro ?

jhpoelen commented 2 years ago

Note that I ended up implementing my own cli for handling dwca ; ( I guess sometimes you just have to roll your own to fit a specific use case. With the cmd, you can know stream all of gbif using a command like the one described in https://github.com/bio-guoda/preston/issues/148 . Holler if you change your mind on building a cli, happy to give it a spin.

timrobertson100 commented 2 years ago

Thanks @jhpoelen If your implementation is something that could be reused and you think would be useful to others to have in this library then PR's are always welcome.

jhpoelen commented 2 years ago

@timrobertson100 for sure.

For now, the dwca streaming functionality is part of the preston cli . a small part of your dwca-io library is re-used (e.g., reading meta.xml / record iterator). Works great! Would be neat to have a small focused library that only does that. Now, all these other dependencies are pulled in.

Is dwc-io still the library you use in the gbif infrastructure?

timrobertson100 commented 2 years ago

Is dwc-io still the library you use in the gbif infrastructure?

Yes, it is e.g. here. We turn everything to Avro in the first stage of processing in GBIF though.

Unrelated to this issue, but for background info - we're exploring Frictionless Data as a replacement for the limited DwC-A format.

jhpoelen commented 2 years ago

@timrobertson100 thanks for pointing out that you are no longer using the https://github.com/gbif/dwc-io module, but are using a copied version of it embedded in the "core" module of the https://github.com/gbif/pipelines instead.

Thanks for pointing our the "frictionless" data experiments. Does that mean that GBIF is departing from the DwC ? How are you planning to transition? Would I be correct to assume that this "frictionless" data experiment is related to the big splash you made earlier this month re: https://discourse.gbif.org/t/use-case-biotic-interactions-sottunga-island-melitaea-cinxia-population-study/3312 and other use cases?

jhpoelen commented 2 years ago

Also, just wondering - why didn't you refactor and re-use the https://github.com/gbif/dwca-io module in the pipelines project? And, how are you planning to keep them in sync?

timrobertson100 commented 2 years ago

I think you've misread that - pipelines use use dwc-io so consistency is not an issue.

GBIF is not departing from Darwin Core as it's the core standard for much of what GBIF does. We're exploring richer data exchange formats as part of the work to diversity the data model and frictionless looks like a reasonable packaging format to explore.

muttcg commented 2 years ago

We also have simple DWCA->AVRO CLI implementation here https://github.com/gbif/pipelines/blob/dev/tools/archives-converters/src/main/java/org/gbif/converters/DwcaToAvroConverter.java#L28

jhpoelen commented 2 years ago

@timrobertson100 I am glad I misread that and thanks for clarifying. Great to see you are re-using existing libraries. I was confused by the what I thought were cloned implementation of the DwcaReader classes. Am still hoping I can convince you and you colleagues to publish to maven central to make it easier to discover and use your valuable libraries (#55 ). Free coffee and cookies? An ice cream? What will it take?

And great to hear that you are extending support for additional schemas beyond dwc . I've very much enjoyed using W3C CSV for many years now. I am curious to see how you'll end up managing the schemas (e.g., versioning) and data (e.g., provenance).