Open dshorthouse opened 9 years ago
The harvester can not 'detect' duplicates in an scalable way as other tools (e.g. dwca-validator) can do it. The current errors are thrown by PostgreSQL but INSERTS are run in batches meaning that the whole transaction is then dropped. In other words, you can not only keep a reference to what was inserted since it won't scale and you can't use the PostgreSQL error since you still want to use batch INSERTS. The only scalable and robust solution is that the cli could use the dwca-validator(internally with the proper parameters) to generate a json object and then use to json object as exclusion list.
I suggest combining the cli option to use a skip list by producing & then using one on-the-fly.