LibreCat / Catmandu

Catmandu - a data processing toolkit
https://librecat.org
177 stars 31 forks source link

discuss design of importers #100

Closed vpeil closed 10 years ago

vpeil commented 10 years ago

While updating/creating some importers I faced some problems. Here's an example: Catmandu::Importer::Inspire returns exactly one item, which is just the whole xml structure:

<collection>
  <records>
    <record>{data of one bibliographic record no.1}</record>
    <record>{data of one bibliographic record no.2}</record>
    <record>{data of one bibliographic record no.3}</record>
  </records>
</collection>

In this case, $importer->count is 1. But I want it to be 3, of course. And the Fixes are more complicated if the path start with "collection" always. The same problem is with the importers ArXiv, CrossRef, EuropePMC...

How should the sub generator {} look like?

Any suggestions?

nichtich commented 10 years ago

What if the generator returns an array reference? I would expect this to be interpreted as a list of multiple items.

phochste commented 10 years ago

Well, same as for Stores that return a result set from a database? You construct in memory a stack of records and return them one by one

 sub {
     state @stack = parseResults;
     pop @stack;
 }
phochste commented 10 years ago

Or you can use a pull xml parser and use its state to fetch the next record

nichtich commented 10 years ago

Please have a look at Catmandu::XML instead of implementing XML parsing in each new importer. Catmandu::XML uses a pull parser and already supports cutting one XML stream into multiple records:

catmandu convert XML --path record < collection.xml

{"record":"{data of one bibliographic record no.1}"}
{"record":"{data of one bibliographic record no.2}"}
{"record":"{data of one bibliographic record no.3}"}

You can use a Catmandu::Importer::XML that is fed each collection and returns the records one by one. You could also use XML::Struct as Catmandu::Importer::XML is just a thin layer on top, but with Catmandu::Importer::XML you'll get new features of Catmandu::XML, such optional XSLT processing for free .

vpeil commented 10 years ago

thank you, guys. This one works pretty fine: I use Catmandu::Importer::XML as suggested by @nichtich. Then ->to_array and

sub {
     state @stack = parseResults;
     pop @stack;
 }

as suggested by @phochste. See for example: https://github.com/LibreCat/Catmandu-Inspire/commit/69385213a673537906775a1df811dcce6bf72c86

I'll update the other importers is the same manner soon.