WingLongitude / lontra-harvester

Lontra is a tool used as a Harvester to ingest biodiversity data
MIT License
1 stars 3 forks source link

Ability to automatically skip duplicates during harvest #25

Open dshorthouse opened 9 years ago

dshorthouse commented 9 years ago

I suggest combining the cli option to use a skip list by producing & then using one on-the-fly.

cgendreau commented 9 years ago

The harvester can not 'detect' duplicates in an scalable way as other tools (e.g. dwca-validator) can do it. The current errors are thrown by PostgreSQL but INSERTS are run in batches meaning that the whole transaction is then dropped. In other words, you can not only keep a reference to what was inserted since it won't scale and you can't use the PostgreSQL error since you still want to use batch INSERTS. The only scalable and robust solution is that the cli could use the dwca-validator(internally with the proper parameters) to generate a json object and then use to json object as exclusion list.