FAIRDataPipeline / data-registry

The FAIR Data Registry is a Django website and REST API which is used by the FAIR Data Pipeline to store metadata about code runs and their inputs and outputs
BSD 2-Clause "Simplified" License
0 stars 1 forks source link

RO Crate #189

Closed antony-wilson closed 1 year ago

codecov[bot] commented 2 years ago

Codecov Report

Merging #189 (82e7db7) into main (7377772) will increase coverage by 0.97%. The diff coverage is 93.79%.

@@            Coverage Diff             @@
##             main     #189      +/-   ##
==========================================
+ Coverage   86.96%   87.94%   +0.97%     
==========================================
  Files          34       35       +1     
  Lines        2647     3077     +430     
==========================================
+ Hits         2302     2706     +404     
- Misses        345      371      +26     
Flag Coverage Δ
unittests 87.94% <93.79%> (+0.97%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
data_management/urls.py 95.23% <ø> (ø)
data_management/rest/views.py 75.55% <66.00%> (-2.18%) :arrow_down:
data_management/rest/serializers.py 95.52% <83.33%> (-2.66%) :arrow_down:
data_management/rocrate.py 97.14% <97.14%> (ø)
data_management/models.py 84.06% <100.00%> (+0.86%) :arrow_up:
data_management/tests/init_prov_db.py 95.61% <100.00%> (+1.71%) :arrow_up:
data_management/tests/test_api.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

sonarcloud[bot] commented 2 years ago

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot E 2 Security Hotspots
Code Smell A 5 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

antony-wilson commented 2 years ago

The submission script and config file are now included in the zip file:

├── inputs
│   ├── data
│   │   └── 72e13dceb2a924f0babad5e1920b3191af0ebe50.csv
│   ├── model_config
│   │   └── config.yaml
│   └── submission_script
│       └── c2351d9bb49857728421e9344d88a45f9e88e835.toml
├── outputs
│   ├── a5ffd3479af8e37f9ea128a36b5aeb75240d1160.pdf
│   └── c2351d9bb49857728421e9344d88a45f9e88e835.toml
└── ro-crate-metadata.json

@RyanJField was there anything else to include?

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

antony-wilson commented 1 year ago

The inclusion of author/1 may be a bug. I think the reference should point to "@id": "https://orcid.org/000-0000-0000-0000",

RyanJField commented 1 year ago

The inclusion of author/1 may be a bug. I think the reference should point to "@id": "https://orcid.org/000-0000-0000-0000",

The inclusion of author/1 may be a bug. I think the reference should point to "@id": "https://orcid.org/000-0000-0000-0000",

This is now the json I get:

{
    "@context": [
        "https://w3id.org/ro/crate/1.1/context",
        {
            "sha1": "http://xmlns.com/foaf/0.1/#term_sha1"
        }
    ],
    "@graph": [
        {
            "@id": "./",
            "@type": "Dataset",
            "datePublished": "2022-12-13T11:19:29.164067",
            "hasPart": [
                {
                    "@id": "outputs/14da2266b09360aa5cd36a9501a079aac9538634.png"
                },
                {
                    "@id": "inputs/model_config/config.yaml"
                },
                {
                    "@id": "inputs/submission_script/script.sh"
                },
                {
                    "@id": "inputs/data/1.0.0.csv"
                },
                {
                    "@id": "https://doi.org/10.1038/s41592-020-0856-2"
                }
            ],
            "license": {
                "@id": "https://creativecommons.org/licenses/by/4.0/"
            },
            "name": "RO Crate for SEIRS_model/results/figure/python",
            "publisher": "FAIR Data Pipeline"
        },
        {
            "@id": "ro-crate-metadata.json",
            "@type": "CreativeWork",
            "about": {
                "@id": "./"
            },
            "conformsTo": {
                "@id": "https://w3id.org/ro/crate/1.1"
            },
            "license": {
                "@id": "https://creativecommons.org/publicdomain/zero/1.0/"
            }
        },
        {
            "@id": "https://creativecommons.org/licenses/by/4.0/",
            "@type": "CreativeWork",
            "description": "Attribution 4.0 International",
            "identifier": "https://creativecommons.org/licenses/by/4.0/",
            "name": "CC BY 4.0"
        },
        {
            "@id": "https://creativecommons.org/publicdomain/zero/1.0/",
            "@type": "CreativeWork",
            "description": "CC0 1.0 Universal (CC0 1.0) Public Domain Dedication",
            "identifier": "https://creativecommons.org/publicdomain/zero/1.0/",
            "name": "CC0 Public Domain Dedication"
        },
        {
            "@id": "outputs/14da2266b09360aa5cd36a9501a079aac9538634.png",
            "@type": "File",
            "author": [
                {
                    "@id": "https://orcid.org/000-0000-0000-0000"
                }
            ],
            "description": "SEIRS output plot",
            "encodingFormat": "image/png",
            "name": "SEIRS_model/results/figure/python",
            "sha1": "14da2266b09360aa5cd36a9501a079aac9538634"
        },
        {
            "@id": "https://orcid.org/000-0000-0000-0000",
            "@type": "Person",
            "name": "Interface Test"
        },
        {
            "@id": "http://127.0.0.1:8000/api/code_run/1",
            "@type": "CreateAction",
            "agent": {
                "@id": "https://orcid.org/000-0000-0000-0000"
            },
            "description": "SEIRS Model python",
            "instrument": {
                "@id": "https://github.com/https://github.com/FAIRDataPipeline/pySimpleModel"
            },
            "name": "code run 1",
            "object": [
                {
                    "@id": "inputs/model_config/config.yaml"
                },
                {
                    "@id": "inputs/submission_script/script.sh"
                },
                {
                    "@id": "inputs/data/1.0.0.csv"
                }
            ],
            "result": {
                "@id": "outputs/14da2266b09360aa5cd36a9501a079aac9538634.png"
            },
            "startTime": "2022-12-13T11:17:39.498545+00:00"
        },
        {
            "@id": "https://github.com/https://github.com/FAIRDataPipeline/pySimpleModel",
            "@type": "SoftwareApplication",
            "author": [
                {
                    "@id": "https://orcid.org/000-0000-0000-0000"
                }
            ],
            "url": "https://github.com/https://github.com/FAIRDataPipeline/pySimpleModel"
        },
        {
            "@id": "inputs/model_config/config.yaml",
            "@type": [
                "File",
                "SoftwareSourceCode"
            ],
            "author": [
                {
                    "@id": "https://orcid.org/000-0000-0000-0000"
                }
            ],
            "description": "Working config.yaml location in datastore",
            "encodingFormat": "yaml",
            "name": "config.yaml",
            "sha1": "a010936d503444515e625cc4c7c5c842d031d9aa"
        },
        {
            "@id": "inputs/submission_script/script.sh",
            "@type": [
                "File",
                "SoftwareSourceCode"
            ],
            "author": [
                {
                    "@id": "https://orcid.org/000-0000-0000-0000"
                }
            ],
            "description": "Working script location in datastore",
            "encodingFormat": "text/x-sh",
            "name": "script.sh",
            "sha1": "f35c1cd83fbe1a458d71da1aae90ed2e8db2b031"
        },
        {
            "@id": "inputs/data/1.0.0.csv",
            "@type": "File",
            "author": [
                {
                    "@id": "https://orcid.org/000-0000-0000-0000"
                }
            ],
            "description": "Static parameters of the model",
            "encodingFormat": "text/csv",
            "name": "SEIRS_model/parameters",
            "sha1": "6294a5951677e6b8438cabf55234b7974adeaee3"
        },
        {
            "@id": "https://doi.org/10.1038/s41592-020-0856-2",
            "@type": "File",
            "datePublished": "2021-09-20 12:00:00+00:00",
            "name": "Static parameters of the model"
        },
        {
            "@id": "http://127.0.0.1:8000/api/data_extraction/1",
            "@type": "CreateAction",
            "description": "import/extract data from an external source",
            "name": "data extraction 1",
            "object": {
                "@id": "https://doi.org/10.1038/s41592-020-0856-2"
            },
            "result": {
                "@id": "inputs/data/1.0.0.csv"
            },
            "startTime": "2022-12-13T11:17:32.616245+00:00"
        }
    ]
}

The Orcid ID is okay if there is one, but it's an optional field, it might need to fall back to the url if no Orcid ID is set?

simleo commented 1 year ago

The person's @id does not necessarily have to be an ORCID url. You can use an internal, possibly randomly generated identifier, e.g.:

        {
            "@id": "#3b6dd3e2-12f8-428f-833c-fa4314c9ae50",
            "@type": "Person",
            "name": "Interface Test"
        },

ro-crate-py automatically generates a random identifier if you don't specify one:

p = crate.add(Person(crate, properties={"name": "Interface Test"}))

In cases like this where there is no actual agent, of course, one can simply not add the agent property to the CreateAction altogether (it's not required).

Regarding properties for file checksums, they're not in the standard RO-Crate context. We're going to add them to the workflow-run ro-terms namespace soon though, see https://github.com/ResearchObject/ro-terms/issues/14. When that is done, crate authors will be able to use those properties by adding them as an extension to the context. E.g.:

EXTRA_TERMS = {
    "sha1": "https://w3id.org/ro/terms/workflow-run#sha1"
}
crate.metadata.extra_terms.update(EXTRA_TERMS)

Finally, note that the data extraction action is missing the instrument. It needs to point to the relevant software tool, like the code run action is doing.

antony-wilson commented 1 year ago

@RyanJField do we have anything that we could use as instrument to say how the DataProduct was extracted from the ExternalObject?

RyanJField commented 1 year ago

@RyanJField do we have anything that we could use as instrument to say how the DataProduct was extracted from the ExternalObject?

No the pull command does not push a code run. I can add the functionality to do so. @richardreeve should the pull command push a code run to the registry, and should it only do so in the case of an external object?

richardreeve commented 1 year ago

Apologies for the delay here. There are two cases where we “convert” an external object into a data product.

I don’t think I either case the code run would be appropriate, because there is never anything “run” to do the work, but I’m not sure what else we could say?

Edit: the only instrument I can think if that is generic enough is maybe something really unhelpful like a computer?

richardreeve commented 1 year ago

On the id front, it seems like we could just do as suggested for empty ORCIDs and replace any individual anonymous author with a randomly generated local id, so it’s the same id through the RO Crate - would that be possible @antony-wilson? That was we can distinguish different anonymous authors from one another.

RyanJField commented 1 year ago

Apologies for the delay here. There are two cases where we “convert” an external object into a data product.

  • In one case, the external object is the data product, so I’m not sure there is an instrument? The two are just the same entity - we can see this internally, because the external object is tagged as primary.
  • In the second case, the external object is tagged as supplementary internally, and then there is some unknown additional process involved in converting the external object into a data product. Often this is something like just retyping a table from a paper by hand, but in any case we do not record what is done. In that case, @simleo I’m not sure what a suitable genetic instrument would be?

I don’t think I either case the code run would be appropriate, because there is never anything “run” to do the work, but I’m not sure what else we could say?

I would suggest that the instrument is the CLI, as the CLI downloads the file and renames it... when pull is called the CLI generates a Job ID (based on the time and date) and config.yaml file for the pull.

antony-wilson commented 1 year ago

The process of going from a external object to an internal object is an activity and I've name it data_extraction.

Re @richardreeve 2 cases for external objects

Case one, agreed, there is no need for a data_extraction activity

Case two, we have a data_extraction activity that makes use of an instrument, from @RyanJField comments it sounds like that the instrument is the CLI

richardreeve commented 1 year ago

I agree it is the CLI that is physically moving the file from the remote store to the local one, but this is really about moving the data from the external source to the internal data product isn’t it? That isn’t being done by the CLI - it’s being done by a human somehow through an unknown instrument. I think the most we can say is that it’s a computer or something equally generic, surely?

richardreeve commented 1 year ago

In any event, are adding a computer instrument and a local id to replace the zeroed orcids both things we could potentially do this week so we could merge this PR and then raise an issue if we think it’s wrong later?

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

antony-wilson commented 1 year ago

The ROCrate is now using the CLI as the instrument for data_extraction activitys

Currently if there is no orchid id the code falls back to the local id i.e. http://127.0.0.1:8000/api/author/5. Hopefully this is sufficient for now and I'll add the random id stuff in the New Year.

richardreeve commented 1 year ago

Okay, if that’s the only thing, shall we merge this then and raise an issue about the non-uniqueness of the id? And do we need to do the same for the file checksums, or can we fix that now?

antony-wilson commented 1 year ago

I think merge now and update the check sum when https://github.com/ResearchObject/ro-terms/issues/14 is released

RyanJField commented 1 year ago

I think merge now and update the check sum when ResearchObject/ro-terms#14 is released

I agree this can now be merged.

richardreeve commented 1 year ago

Okay, do one of you want to merge it then? I'm happy.

simleo commented 1 year ago

In principle, an instrument can be anything (the expected value type is Thing), including a computer. However, in the context of software execution, it should point to a specific application, even if it's just a basic one for data transfer (e.g., cp, curl, ...). So I think that pointing to the CLI like you did was the right decision.

I've addressed https://github.com/ResearchObject/ro-terms/issues/14 in https://github.com/ResearchObject/ro-terms/pull/15. To avoid bloating the namespace, for now I've added only md5, sha1, sha256 and sha512. If you need some other variant, just open another issue.

By the way, the current draft of the Workflow Run RO-Crate profiles is now nicely formatted at https://www.researchobject.org/workflow-run-crate/profiles/ :slightly_smiling_face: