Sample Cut Task - Githubissues

aalbino2 commented 1 year ago

I have not clear in mind how we support the case where a user is cutting a sample, generating new IDs for the children and then probably opening new workflows for each of them, but still wants to keep track of where that sample is coming from.

Any elucidation? @hampusnasstrom @Pepe-Marquez @budschi

aalbino2 commented 1 year ago

Should we place a parent_sample quantity somewhere? Or should we simply generate the IDs ad-hoc so that one can find both parent and children? E.g. parent sample ID is #xyz and children are #xyz_1, etc.

aalbino2 commented 1 year ago

How do you currently deal with generation of children Samples? @hampusnasstrom could you plean invite also Michael in this repo I guess he mught have already an implementation, thx

hampusnasstrom commented 1 year ago

If I understand correctly @RoteKekse uses batches which references to all children but I'm not sure if he has implemented anything for breaking samples. One way to deal with it would be to just let them be linked through a sample breaking process but for this we need to have a discussion on how we propagate properties. We could also put a repeating quantity of references to children but I guess only in that direction to avoid a circular reference.

aalbino2 commented 1 year ago

Yes, a quantity where the reference to the parent is okay. I was thinking to a Task where you have the number N of children you want to generate as a integer quantity. Pressing save should generate N outputs with SampleIDs, references to parent and whatever else.

What you guys think?

hampusnasstrom commented 1 year ago

Okay, that would be the opposite of what I wrote.

But yes, I think that would be good. Then we just have to decide if we propagate all properties (except for geometry I guess). We should probably also offer a way to indicate how the samples were broken.

aalbino2 commented 1 year ago

Good point, we need to know which property to propagate. I would start with propagating only the SampleID properly extended within the normalize function to be unique. Considering that different classes of samples can be splitted, and we don't know RN which property they would have inside, which property to copy in the children is something we can figure out later.

The current issue here is mainly thought to avoid users such @lapmk @A-D-Fuchs @GEllrott @SebLoUniv to manually instantiate many samples generated by a cut.

You guys could go check this yaml files I generated to put down an example of what you told me in our last meeting:

https://github.com/FAIRmat-NFDI/AreaA-data_modeling_and_schemas/tree/main/example_for_workflow

I will try to address also the other points listed in the notes that I put in the root level of the schema. Just try to upload these files and read the comment

RoteKekse commented 1 year ago

Hey @aalbino2, there is a practical thing to consider here. Sample hierachy can be confusing to users, especially, if certain properties are not populated and only there by reference. In https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1099 we discussed already mere duplication. Duplication has the advantage, that it is simple, ideally on duplication a new id given.

You could still maintain a linkage to the parent sample, since all process which where not entered for the parent before breaking would be lost.

In principle, what @hampusnasstrom says is the cleanest. And for internal modelling this should be the goal, but I think a simple duplication feature, not even implemented as a nomad EntryData has the best chance to not raise too many confusions. Since this is not implemented this could be done by a process for now.

aalbino2 commented 1 year ago

can you point or show an example of the "cleanest way" cos I don't understand ? thx

RoteKekse commented 1 year ago

Hey andrea, of course. So a good overarching model for anything what is happening is an entity-activity modelling. https://www.w3.org/TR/prov-dm/ here a bit background on that.

Entities are samples, solutions, instruments, data sets, etc. Activities are processes, measurements etc.

There is an alternating relation between entities and activities. Each entity can be an input or an output to some activity. activities can have n inputs and m outputs. There is always a clear distinction between an entity before and after an activity. If properties/attributes are preserved through the activity the activity needs to take care of that through the output.

THis gives the most flexibility, but will lead in practice to bunch of entities which seem to be more or less the same and this could be confusing to users.

aalbino2 commented 1 year ago

Ok thanks, this is how this repo is structured, if you check out the readme you will see the very familiar image with task-input-output backbone.

I didn't get how to operate with the breaking sample Task in the cleanest way yet!

I remember you implemented somewhere this generation of multiple archives from one parent, am I wrong?

Copying archives could be fine indeed, good that u asked for a more exposed button.

Even more, the guys I tagged in the comments above indeed don't want to generate a forest of archives with new IDs after every task, the are more Sample centered and would like to stick to one input at the beginning of a workflow. But this is a matter involving their user specific datamodel, and it clearly does not apply in a breaking sample Task as there we really generate new things that need IDs.

Should we implement the samples breaking copying archive files and giving new IDs? Doesn't look bad to me

RoteKekse commented 1 year ago

yes iw ould recommend that. just make copies and give it anew id. you could consider to attach something to the old id, so that it is clear from the id, that it is a child. e.g. id_parent becomes id_parent_id_child and for each breaking you extend it if needed.

aalbino2 commented 1 year ago

sure, do you have a normalize function where you copy archives - so I can borrow from there? Thx

RoteKekse commented 1 year ago

Kind of. it is something similar: https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/blob/eln_application_hzb_chemical_energy/nomad/datamodel/metainfo/eln/application_hzb/CE_NOME/__init__.py#L93 In here i create a batch (which is a subclass of sample) in the batch i say how many samples i want. then i query the next sample id and create samples

if it is not clear let me know i can explan in more detail then

RoteKekse commented 1 year ago

the create_archive function is at the top of the file

aalbino2 commented 1 year ago

ok great! I'll take a look there and let you know if I have problems, thanks

aalbino2 commented 1 year ago

This is a task that I implemented long time ago now, would like a confirmation/rejection feedback on that

the branch is this one: https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/tree/1315-extension-of-workflow-classes

when we come together to discuss, I'll show you how it looks like

FAIRmat-NFDI / AreaA-data_modeling_and_schemas

Sample Cut Task #6