ebi-ait / checklist

Template repository for checklists
Apache License 2.0
1 stars 0 forks source link

Issue observed while testing sample validation in BioSamples, Webin-REST tests #76

Closed amnonkhen closed 4 weeks ago

amnonkhen commented 1 month ago

Opened by @dgupta on 21/8 As part of the preparation to deploy Webin REST in dev which will connect to BioSamples in dev, below issue has been observed:

Sample metadata: ```xml FTP2_ITS 410658 soil metagenome Fort Tryon Park 2. ITS investigation type metagenome project name nyc_parks_medians sequencing method Illumina collection date 2013-05-31 environmental package soil geographic location (latitude) 40.86622 DD geographic location (longitude) -73.93168 DD geographic location (country and/or sea) USA geographic location (depth) 10.00 m soil environmental package soil environment (biome) ENVO:urban biome environment (feature) soil environment (material) soil depth 0.1 m geographic location (elevation) 1 m broad-scale environmental context soil environmental medium soil ENA-CHECKLIST ERC000022 ```

Sample contains:

<SAMPLE_ATTRIBUTE>
                <TAG>geographic location (elevation)</TAG>
                <VALUE>1</VALUE>
                <UNITS>m</UNITS>
            </SAMPLE_ATTRIBUTE>

elevation and geographic location (elevation) are synomyms, see checklist:

<FIELD>
          <LABEL>elevation</LABEL>
          <SYNONYM>geographic location (elevation)</SYNONYM>
          <NAME>elevation</NAME>
          <DESCRIPTION>The elevation of the sampling site as measured by the vertical distance from mean sea level.</DESCRIPTION>
          <UNITS>
            <UNIT>m</UNIT>
          </UNITS>
          <FIELD_TYPE>
            <TEXT_FIELD>
              <REGEX_VALUE>([+-]?(0|((0\.)|([1-9][0-9]*\.?))[0-9]*)([Ee][+-]?[0-9]+)?)|((^not collected$)|(^not provided$)|(^restricted access$)|(^missing: control sample$)|(^missing: sample group$)|(^missing: synthetic construct$)|(^missing: lab stock$)|(^missing: third party data$)|(^missing: data agreement established pre-2023$)|(^missing: endangered species$)|(^missing: human-identifiable$))</REGEX_VALUE>
            </TEXT_FIELD>
          </FIELD_TYPE>
          <MANDATORY>mandatory</MANDATORY>
          <MULTIPLICITY>multiple</MULTIPLICITY>
        </FIELD>

Error in BioSamples:

2:34:07.431 [Test worker] INFO uk.ac.ebi.ena.sra.SRALoader - load processed in 678ms
12:34:07.432 [Test worker] INFO uk.ac.ebi.ena.sra.utils.Common - *|ERROR: 2024_08_21_12_34_05_619__EBI_SUB_SRA_TEST_ALIAS__1 failed validation due to should have required property 'local environmental context'
12:34:07.432 [Test worker] INFO uk.ac.ebi.ena.sra.utils.Common - *|ERROR: 2024_08_21_12_34_05_619__EBI_SUB_SRA_TEST_ALIAS__1 failed validation due to should have required property 'elevation'
12:34:07.432 [Test worker] INFO uk.ac.ebi.ena.sra.utils.Common - *|ERROR: Failed to submit samples to BioSamples
amnonkhen commented 1 month ago

Investigation:

  1. get JSON copy of the reported ENA XML document
  2. validate against json schema checklist
  3. inspect errors. They might be related to the synonym error messages issue #67 .
theisuru commented 1 month ago

These schemas work correctly in local environment. For some reason, wp-np2-44 contains very old version of schema. We will test this again once we imported revised checklist as mentioned in #77 .

theisuru commented 1 month ago

@dipayan1985 This was due to old version of schema. Now latest version is updated and should work on dev (wp-np2-44) environment.

dipayan1985 commented 1 month ago

While testing, I am observing a different issue now:

No validation errors expected: 
Actual validation ERROR: ERAM.1.0.30
message: 2024_09_02_14_43_59_541__EBI_SUB_SRA_TEST_ALIAS__1 failed validation due to Just one of the following properties must be specified: 'geographic location (depth)', 'depth', 'Depth'
origin: 
Actual validation ERROR: ERAM.1.0.29
message: Failed to submit samples to BioSamples
origin: 
Actual validation ERROR: ERAM.1.0.29
message: Failed to submit samples to BioSamples
origin: 
Actual validation ERROR: ERAM.EXCEPTION
message: An exception occurred: java.lang.RuntimeException: Failed to submit all samples to BioSamples

The sample is:

<?xml version = '1.0' encoding = 'UTF-8'?><SAMPLE_SET>
    <SAMPLE alias="" center_name="">
        <TITLE>FTP2_ITS</TITLE>
        <SAMPLE_NAME>
            <TAXON_ID>410658</TAXON_ID>
            <SCIENTIFIC_NAME>soil metagenome</SCIENTIFIC_NAME>
        </SAMPLE_NAME>
        <DESCRIPTION>Fort Tryon Park 2. ITS</DESCRIPTION>
        <SAMPLE_ATTRIBUTES>
            <SAMPLE_ATTRIBUTE>
                <TAG>investigation type</TAG>
                <VALUE>metagenome</VALUE>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>project name</TAG>
                <VALUE>nyc_parks_medians</VALUE>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>sequencing method</TAG>
                <VALUE>Illumina</VALUE>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>collection date</TAG>
                <VALUE>2013-05-31</VALUE>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>environmental package</TAG>
                <VALUE>soil</VALUE>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>geographic location (latitude)</TAG>
                <VALUE>40.86622</VALUE>
                <UNITS>DD</UNITS>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>geographic location (longitude)</TAG>
                <VALUE>-73.93168</VALUE>
                <UNITS>DD</UNITS>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>geographic location (country and/or sea)</TAG>
                <VALUE>USA</VALUE>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>geographic location (depth)</TAG>
                <VALUE>10.00</VALUE>
                <UNITS>m</UNITS>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>soil environmental package</TAG>
                <VALUE>soil</VALUE>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>environment (biome)</TAG>
                <VALUE>ENVO:urban biome</VALUE>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>environment (feature)</TAG>
                <VALUE>soil</VALUE>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>environment (material)</TAG>
                <VALUE>soil</VALUE>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>depth</TAG>
                <VALUE>0.1</VALUE>
                <UNITS>m</UNITS>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>geographic location (elevation)</TAG>
                <VALUE>1</VALUE>
                <UNITS>m</UNITS>
            </SAMPLE_ATTRIBUTE>
            <SAMPLE_ATTRIBUTE>
                <TAG>ENA-CHECKLIST</TAG>
                <VALUE>ERC000022</VALUE>
            </SAMPLE_ATTRIBUTE>
        </SAMPLE_ATTRIBUTES>
    </SAMPLE>
</SAMPLE_SET>

Not sure why the error should just one of the following be provided. -> 'geographic location (depth)', 'depth', 'Depth' Existing behavior accepts multiple.

/cc @theisuru @amnonkhen @gabsie

dipayan1985 commented 1 month ago

As discussed with Colman and Peter, sample metadata should have one unique attribute representing the field name or a synonym. Webin tests have been adapted and they are passing, this ticket can be marked as done.

/cc @theisuru @amnonkhen