ResearchObject / ro-crate-py

Python library for RO-Crate
https://pypi.org/project/rocrate/
Apache License 2.0
46 stars 23 forks source link

file entitites overwritten without warnig #185

Open mashehu opened 1 month ago

mashehu commented 1 month ago

Hi,

I am currently working on adding RO-crates to the nf-core CLI tools: https://github.com/nf-core/tools/pull/2680

nf-core pipelines usually have a root main.nf file, which should be a ComputationalWorkflow entity, but they also have modules with their own nested main.nf e.g. modules/local/cat_additional_fasta/main.nf. Trying to add all the different nested files using crate.add_file(fn, properties={"programmingLanguage": {"@id": "#nextflow"}}) where fn is the file path, e.g., modules/local/cat_additional_fasta/main.nf results in only one main.nf entry. How do I get the full path as the entry id to avoid overwriting duplicates?

Additionally, a warning that an @id already exists before overwritting it, would be helpful.

elichad commented 1 month ago

@simleo is this a bug? I'm not familiar enough with the code, but I assumed that file entities would use the full file path as @id.

elichad commented 4 weeks ago

@mashehu and I agreed on the WRROC call that this is a bug.

As a workaround, you should be able to use the dest_path argument to manually set the intended file path within the crate, and therefore the @id:

crate.add_file(fn, dest_path=fn, properties={"programmingLanguage": {"@id": "#nextflow"}})

Also, you might like to look at the add_workflow() function which could save you a few lines elsewhere. Something like:

crate.add_workflow(                # sets @type and conformsTo according to Workflow RO-Crate spec
    fn, 
    dest_path=fn, 
    main=True,                     # sets the added workflow as main entity
    properties={ . . . },                
    lang="nextflow"                # adds the #nextflow entity automatically and connects it to programmingLanguage
    lang_version="X.Y.Z",          # sets version on #nextflow
) 
mashehu commented 3 weeks ago

Thanks @elichad, specifying dest_path works.

add_workflow() is indeed helpful, is it documented somewhere?

elichad commented 3 weeks ago

add_workflow() is not documented yet - I myself only learned about it very recently, and raised #186 to flag that we need to fix that gap in the documentation

simleo commented 1 week ago

mashehu and I agreed on the WRROC call that this is a bug.

As a workaround, you should be able to use the dest_path argument to manually set the intended file path within the crate, and therefore the @id

It's not a bug, and using dest_path is not a workaround, but the normal way to set the data entity's path / id within the crate, which is always relative to the crate's root directory (which is not known until the crate is written).

When dest_path is not specified, the default behavior is to place the data entity (file or directory) at the top level, thus setting the id to the basename. Using the source path as the destination path does not work in general:

If you want to check exactly what happens, take a look at FileOrDir's __init__ method and File's write method

A usage example can be found in repo2rocrate.

Usage of dest_path is demonstrated at the beginning of the README, in Creating an RO-Crate. I will review the documentation to see if I can better clarify how things work.