NHMDenmark / Mass-Digitizer

Common repo for the DaSSCo team
Apache License 2.0
1 stars 0 forks source link

Adding source name to the Specify Collection object table #461

Closed jlegind closed 1 month ago

jlegind commented 6 months ago

What is the issue ?

It would help in debugging and 'housekeeping' if the imported Digi app records had their original source file name attached in a separate field:

Source "NHMD_PinnedInsects_20231121_16_16_SS_original.csv"

Detailed description of the issue.

If there is a discrepancy between imported records in Specify and what is in the 4.Archive directory, then having the source path would be a massive help.

Why is it needed/relevant ?

We gain a certain amount of future proofing in that it addresses issues like the one above and anticipates unforeseen problems.

Give scenario(s) of why and when this could be relevant.

If a curator discovers something in specify that is a little off the mark, we can go all the way back to the source to investigate. We have already agreed that the postprocessing GREL scripts should have their own version as they evolves with business needs. Adding a source field ties neatly into this as it makes forensics much easier.

Estimate level of effort required.

easy

What could be the challenges ?

There does not seem to be a way to automatically add the file name to a column in open refine. That means it has be added manually in the open refine interface which is a trivial task.

What documentation required?

The documentation file "import_protocol_postProcessing.md" will need to be updated.

PipBrewer commented 6 months ago

It would be good if we could see this in Specify

FedorSteeman commented 5 months ago

After discussing this with @bhsi-snm : Not sure how easy it is to do in OpenRefine. Perhaps @jlegind can conjure up a little utility program for adding this Source file column? Then I'll repurpose a field in Specify to map it to. Or perhaps investigate OpenRefine option?

FedorSteeman commented 5 months ago

There isn't really a way for GREL to get the file, or rather, OpenRefine project name as far as I can see. The only way I can think of is this being added manually. I would also recommend treating this as a tabular remark field (c.f. #444) so we don't occupy any customizable text fields with it.

jlegind commented 4 months ago

We already have a remarks, the new column might be 'remark_source' which can be: NHMD_PinnedInsects_20240119_15_40_RL_original.csv

jlegind commented 4 months ago

Question: Should remark_date be the date that the export was made, or the date it was post processed?

FedorSteeman commented 4 months ago

As a result from the implementation of #444 we already have a column "remark source", so I suggest you choose another name. As you can see, for tabular remarks, we need three columns:

For the specimen level remarks field, these fields are just prefixed "remark", so you get "remark source" and "remark date".

Actually using the term "source" for the filename of the data is confusing here; Maybe it's better to use "datafile".

So that means the following column names;

@bhsi-snm Do you approve of this proposal?

jlegind commented 4 months ago

name of the data is confusing here; Maybe it's better to use "datafile".

So that means the following column names;

Since we have code ready for monitoring a directory: I could extend this to add "datafile_source" and "datafile_date" to the csv export. This circumvents openRefine.

jlegind commented 3 months ago

The "datafile_source" and "datafile_date" and "datafile_remarks" columns for the tabular remarks have been added through the monitoring script.

jlegind commented 3 months ago

See issue #492 on conditionally adding values in the remarks columns.

AstridBVW commented 1 month ago

The monitoring script was not entirely implemented before Jan left so it has been made part of the post-processing GREL script instead (ticket #506 ).