ResearchObject / ro-crate-py

Python library for RO-Crate
https://pypi.org/project/rocrate/
Apache License 2.0
49 stars 26 forks source link

Odd behaviour when documenting files that don't exist #195

Closed multimeric closed 1 month ago

multimeric commented 1 month ago

Let's say I have an ro-crate-metadata.json in a directory with no other files. Specifically, I want to document some_file.fastq, but it doesn't exist

{
    "@context": "https://w3id.org/ro/crate/1.1/context",
    "@graph": [
        {
            "@type": "CreativeWork",
            "@id": "ro-crate-metadata.json",
            "conformsTo": {
                "@id": "https://w3id.org/ro/crate/1.1"
            },
            "about": {
                "@id": "./"
            }
        },
        {
            "@id": "./",
            "@type": "Dataset"
        },
        {
            "@id": "some_file.fastq",
            "@type": "File"
        }
    ]
}
from rocrate.rocrate import ROCrate
crate = ROCrate(".")
print(crate.dereference("some_file.fastq"))
# None
list(crate.data_entities)
# []
for entity in crate.contextual_entities:
    print(entity.id)
# #some_file.fastq

I think it's reasonably clear from the metadata file that I mean for some_file.fastq to be a data entity because it has the File type. The behaviours that are strange are:

Maybe some of these behaviours could be changed?

multimeric commented 1 month ago

Okay, actually all of these still happen even when I do create the files in the crate directory. What actually happened is that I fell afoul of this part of the spec:

Where files and folders are represented as Data Entities in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the hasPart property.

I know this isn't a validation tool, but I wonder if it would be possible to catch this case where the file exists but isn't referenced in the root entity, because it's currently difficult to infer this.

simleo commented 1 month ago

The spec says:

Data Entities can also be other types, for instance an online database. These SHOULD be of @type: "CreativeWork" and typically have a @id which is an absolute URI.

So in general one cannot infer that an entity is a data entity only because it's a File or Dataset. For this reason the library first reads all entities listed in the root dataset's hasPart as data entities, then it reads all other entities as contextual entities. It is true that File and Dataset are commonly used for data entities though, so in #199 I'm adding a warning that's triggered when an entity that "looks like" a data entity is being read as a contextual entity.

The leading # is automatically added to the id of contextual entities if it's relative, since it's considered to be relative to the RO-Crate itself.

Regarding the fact that no error is raised when a file listed in the metadata is missing from the crate, see #73 and then #136.

multimeric commented 1 month ago

Thanks, I see how this is a tricky problem, but your PR looks like it will help here.