NHMDenmark / Mass-Digitizer

Common repo for the DaSSCo team
Apache License 2.0
1 stars 0 forks source link

Review and refinement of export folder structure #493

Closed AstridBVW closed 2 weeks ago

AstridBVW commented 3 months ago

What is the issue?

The current folder structure for export files (import protocol) on the N-drive needs to be reviewed and refined.

Description

The current folder structure is not so easy to understand and navigate. The documentation for the import protocol is also not very easy to follow. The folder structure needs to be more intuitive and easy to navigate. It could also be beneficial to include some automated processes to eliminate some of the human errors that occur when multiple people are working within the same directory.

The folder structure can be found here: N:\SCI-SNM-DigitalCollections\DaSSCo\DigiApp\Data

The current folder structure looks like this:

Screenshot 2024-04-03 at 14 13 30

Why is it needed/relevant?

The scale of DaSSCo is steadily growing, and the processes/workflows are getting more complex. Also, more people are becoming involved in the processing of DaSSCo data. We need the folder structure/import protocol to accommodate this and not be a cause of issues/confusion.

Estimate level of effort required.

Hard

What is the expected acceptable result.

An export folder structure that accommodates DaSSCo the best way possible and is easy to adapt to any future needs of DaSSCo.

What could be the challenges?

We will need to figure out how to best implement automated processes as part of the folder structure, and if they are even beneficial at this stage.

What test are required ?

The flow of the new folder structure needs to be tested to see if it is intuitive enough/easy enough to navigate. Any automated processes need to be tested thoroughly.

What documentation required?

The documentation file "import_protocol_postProcessing.md" will need to be updated.

Associated issues

Closed issues

454 Pre-processed exports from the app are not currently being saved

349 Naming convention for digi app export files

295 Write protocol for import and data validation

489 Ensuring correct file names on Digi app exports

461 Adding source name to the Specify Collection object table

492 Tabular remarks - condition for setting the additional columns

AstridBVW commented 3 months ago

20240321, an initial mock-up of a new export folder structure was made and sent to @PipBrewer for feedback.

Mock-up version1: Mock-up folder structure v.1.zip

20240402, a new version of the mock-up has been made based on feedback from Pip.

Mock-up version2: Mock-up folder structure v.2.zip

The focus of the new folder structure is to make sure that files from each step of the process are saved to the archive (original, checked/corrected, processed, completely processed file imported to Specify). It is suggested to incorporate some automated processes through monitoring scripts to make sure copies are indeed saved. The first version of the mock-up was focused on processing the data from NHMD, the second version is adapted to fit data from more than one institution. The archive in the new folder structure is divided in subfolders for each institution as the other folders but each institution subfolder is also divided into subfolders for each version of the DigiApp etc. The export files saved to the archive will be saved to the appropriate institution and version subfolders.

The files in the current folders will be looked through and sorted before implementation of any new folder structure.

PipBrewer commented 2 months ago

@AstridBVW This looks good to me. One question: will it be possible to determine at what stage the files in the archive where at in the processing stage by looking at the file names?

AstridBVW commented 2 months ago

@PipBrewer Yes, based on the suffix being "original", "checked", "processed", "forSpecify", or "imported".

AstridBVW commented 2 weeks ago

The new folder structure was implemented on 20240422. The files in the old folders were looked through and sorted. There was a folder leftover with files that needed to be checked further ("Left_over_mess"). The files in this folder have now been checked, and the folder has been deleted.

Part of the new folder structure is a folder for a monitoring script. The monitoring script is not currently implemented, the folder remains and we are working around it for the time being (i.e. we are not using it).

During the checking of the "Left_over_mess" folder, a file was discovered that had not been imported (NHMD_PinnedInsects_20240129_14_36_ABW_checked.csv). A copy of the file was found in the Archive but nowhere else, and it had not been post-processed. After a closer look, it seemed that there might be other older files in the Archive folder that also had not been imported/post-processed. After further investigation, it was discovered that these three export files had also not been post-processed or imported:

NHMD_PinnedInsects_20240126_15_30_MJG_checked_corrected.csv NHMD_PinnedInsects_20240126_16_30_SS_checked.csv NHMD_PinnedInsects_20240129_15_54_LG_checked_corrected.csv

They have now been moved to the "ReadyForOpenRefine" folder to be post-processed.

AstridBVW commented 2 weeks ago

The documentation for the import protocol has now been updated, https://github.com/NHMDenmark/Mass-Digitizer/blob/main/documentation/import_protocol_postProcessing.md.