AuScope / ckan-docker

Scripts and images to run CKAN using Docker Compose
0 stars 2 forks source link

Bulk uploads validation - 2nd iteration #265

Closed kitchenprinzessin3880 closed 2 weeks ago

kitchenprinzessin3880 commented 1 month ago

(1) Could the file be validated as a whole and all error messages (if any) displayed at once?

(2) Validation

  1. The value specified in the parent_sample column must exist either in the sample_number column in the spreadsheet uploaded or in the sample collection selected
  2. ensure these fields follow URL format - author_identifier
  3. depth_from (if specified) < depth_to
  4. acquisition date
    • acquisition_start_date <= acquisition_end_date
    • if the parent sample is specified, then validate the dates against the parent sample dates e.g. the start date of the sample must be either same or later of the start date of its parent sample (the sample should exist before the parent sample)
  5. elevation - must be number
  6. Check metadata across sheets
    • author emails in the 'sample' sheet must match the email in the 'authors' sheet
    • related_resources_urls in the 'sample' sheet must match the related_resource_url in the 'related_resources' sheet
    • project_ids in the 'sample' sheet must match the project_identifier in the 'funding' sheet
  7. ensure all vocabularies specified exist e.g. sample_type = cuttings (does not exist in the spreadsheet)

This is an example of a file I used for the upload ( it contains errors related to 6. above) auscope-sample-template-v3-sample.xlsx I can preview it, when i upload it I receive the error below. the message is not useful as i don't know what led to the error.

Screenshot 2024-07-16 at 6 32 20 PM
NTaherifar commented 1 month ago

@kitchenprinzessin3880 @laughing0li

I have checked the validation part before the PR and encountered some scenarios that cause issues in submitting the samples. I wanted to report these to you so you can decide if additional validation is needed:

1- Sample Name Uniqueness: Check if the sample name (the combination) already exists. If it does, add an error indicating the name should be unique.

2- Depth Validation: Depth should be a valid number. Currently, it checks if "depth from" is less than "depth to," but it doesn't validate the numbers themselves or show the user if the values are invalid.

3- Date Validation: Similar to depth, the module checks if the start date is before the end date but does not show the user if the date format is incorrect or invalid.

4- Author Email Validation: If an author's email is misspelled, the app cannot find the author and does not include them. We need to show a message to the user that the email is not found in the author sheet. I checked an example where I misspelled the emails, and the app could not find the authors and even allowed sample creation with no authors. This needs a check because, in metadata, authors are mandatory and cannot be empty. The UI works fine and triggers errors, but for batch cases, I could create a sample without authors.

validation test.xlsx

Image

5- Funder Validation: The same issue as with authors.

6- Related Resource Validation: The same issue as with authors.

I have attached an Excel file showing an example where the author's email is incorrect but was still able to submit the samples.

kitchenprinzessin3880 commented 3 weeks ago

test file - auscope-sample-template-v3-ex.xlsx

Screenshot 2024-08-20 at 11 00 41 AM
kitchenprinzessin3880 commented 3 weeks ago

@laughing0li

kitchenprinzessin3880 commented 3 weeks ago

i am getting 502 error too when uploading correctly formatted file

Screenshot 2024-08-20 at 11 39 41 AM
laughing0li commented 3 weeks ago

Yes, I am investigating it.

NTaherifar commented 3 weeks ago

The issue of 502 Bad Gateway, has been fixed by adding a default initialization for the 'parent' field. This change ensures that the batch upload process handles missing parent fields, preventing errors during package creation. Optional fields used in validation must be included in the dictionary, even with None values.

Moving forward, we need to be careful about initializing fields that are not defined in the Excel file but are present in our metadata.

kitchenprinzessin3880 commented 3 weeks ago

Moving forward, we need to be careful about initializing fields that are not defined in the Excel file but are present in our metadata. @NTaherifar parent field included in the metadata schema. Can you lease elaborate the issue?

NTaherifar commented 3 weeks ago

@kitchenprinzessin3880

We have two fields for the parent sample: parent, which stores the package_id, and parent_sample, used in the Excel file (which can be the sample name or DOI). In the batch process, we first read the Excel file and create all samples without parent relationships. Afterward, we update the packages to establish these relationships.

We can't do this in one step because the parent might reference a sample in the Excel list that hasn't been created yet, so its package_id isn't available.

Since we use parent_sample during the first step of the batch creation process, the parent field wasn't included in the dictionary. However, CKAN expects the parent field to be provided, even if it's set to None.

kitchenprinzessin3880 commented 3 weeks ago

@NTaherifar that clarifies the issues, thanks you. and i can upload the sample metadata

Screenshot 2024-08-20 at 3 43 56 PM
NTaherifar commented 2 weeks ago

@kitchenprinzessin3880

Both the production and development versions have been deployed and are ready for review.

kitchenprinzessin3880 commented 2 weeks ago

@laughing0li the validation is now complete. Can you add the following after the sentence "The following errors...": Note: The row number starts from 0.

laughing0li commented 2 weeks ago

Test Passed