Closed glatard closed 4 years ago
@Martybird: is the current conp portal the most up to date data? As I recall, the reviewers asked data to be uploaded to open neuro, so the dataset there may be more up to date
The version on openNeuro and OSF are the most up-to-date indeed.
@akhanf Any chance you could update the dataset on CONP so that we have the latest version.
I will investigate on my end how we can link the raw and processed datasets on CONP and datalad.
Thank you!
@akhanf actually, if the dataset that was used to produce the registration outputs is different from the BigBrain dataset hosted on CONP, it would be best to create a separate CONP dataset with the raw data that was used to produce your results.
If there is that need, would you be able to create such a dataset for the CONP portal.
Let me know. Thank you.
@cmadjar @Martybird @akhanf - Dataset has been updated at the source (khanlab-datasets) and a PR has been opened to merge into conpdatasets (https://github.com/conpdatasets/BigBrainMRICoreg/pull/2)
Thank you, Jason!
@akhanf @Martybird Your PR just got merged today meaning the BigBrainMRICoreg dataset is up to date for CONP.
Should the processed data in BigBrainMRICoreg be linked to the native data stored in the CONP BigBrain dataset or did you use different raw files to produce the processed data?
Let me know. Thank you!
Hi @cmadjar ,
The resulting data were all processed with MINC2 format of the BigBrain dataset, and then converted to NIFTI. The final repository contains both data types, but the spatial transformations are only available in MINC format.
Hi @Martybird,
Thank you for the quick reply.
So if I understand correctly, we could link the BigBrain dataset with the BigBrainMRICoreg dataset on the CONP portal as raw and derived datasets respectively?
Thank you!
@cmadjar Yes.
great! Thank you so much!
@Martybird did you use the version with 125 40-um blocks?
@glatard No, I didn't. I used the previous ICBM registration volume by Claude at 100um and 300um.
so BigBrainMRICoreg isn't derived from the BigBrain dataset we have on CONP. Do you know if that volume that you used is available online anywhere?
@glatard I haven't looked closely on the BigBrain dataset on CONP. The volumes are available at the BigBrain official website https://bigbrain.loris.ca/main.php?test_name=brainvolumes&release=2015
Thanks to all who contributed, correct me if this is not complete but I think there are now:
We should now think of representing relation "is derived from" in the datasets and portal. The main use-case for that would be for users to be able to list the dataset(s) derived from a given parent dataset, or to list parent dataset(s) for a given derived dataset.
DataLad already has a mechanism for that, through sub-datasets: parents of a derived dataset should be added as sub-datasets of the derived dataset. In this way, derived datasets can "declare" the parents that were used, without having to update the parent datasets themselves.
So I think the next steps on this topic should be:
About point 2, it should be noted that while listing parents would be straightforward, listing derived datasets of a given parent requires to go through all the datasets in the platform. This should probably be done aynchronously (using a cron job), to keep the interface responsive.
Tagging @jbpoline as this comes from a discussion we recently had together.
@glatard: Your summary above is accurate.
There is nothing in DATS to specifically record an "isDerivedFrom" relationship, we would have to set that up ourselves under extraProperties, but that would be straightforward to do. Alternatively, the "YODA" data organisation principles mentioned in the DataLad handbook involve storing "isDerivedFrom" implicitly, in structures where a source dataset is always linked as a submodule of a derived dataset; that looks like it might be a bit fiddly to set up, but would only get more so with time.
Hi @emmetaobrien, yes, the YODA recommendation is what I thought we should do in point (1) of my post above (I didn't know it was part of YODA, thanks!). At this stage, I'm not sure if we should add this information in the DATS model, duplicating information in a possibly inconsistent way, or just rely on sub-datasets to store the derivation relation.
I lean slightly to storing the information in DATS because it would be more explicit and not require users to be familiar with YODA, but am open to arguments either way.
I think we should encourage putting parent datasets as sub-datasets no matter what, as it makes it way easier to access the parent data to understand and possibly reproduce the dataset. I'd say the question is whether we want to represent that in the DATS model in addition.
Discussion from the meeting today: we will use both YODA and a property in the DATS model. The DATS model will include URLs of the Git submodule(s) found in the dataset. @emmetaobrien will create an example for that.
There is now an example at https://github.com/conpdatasets/BigBrain_3DSurfaces, which has both the raw BigBrain dataset as a submodule, and the field extraProperties->derivedFrom
containing a link to the BigBrain dataset.
Awesome! Does it require any update in the DATS schema? I think we could then PR that to the main dataset, and pass it on to @xlecours, @liamocn and @3design to represent this information in the portal.
If this seems OK to everyone, I will update the documentation of our DATS schema accordingly.
I have also just done an equivalent update to https://github.com/conpdatasets/BigBrain_3DClassifiedVolumes
The documentation has now been updated. https://github.com/CONP-PCNO/conp-documentation/pull/32
Thanks for all this @emmetaobrien! I guess this is now ready for a PR to conp-datasets? Derived information would then be ready to be displayed in the portal.
https://github.com/CONP-PCNO/conp-dataset/pull/275 is now ready for review.
Thanks @emmetaobrien for all the work in #275. The next step now would be to represent links between derived datasets and their parents in the portal. An easy way to do so would be to just have a list of parents and a list of children dataset in the dataset page. Maybe we could also add an icon on the front page to show that a dataset has x children.
I think we need @3design to chime in here and let us know how this information could be represented.
This is quite interesting. Is there a way we can show it in a kind of relationship flow chart (reminds me of a file browser structure)?
A typical relationship chart may be helpful:
parent 1 L child 1 L child 2 (current dataset)
Of course, it gets a bit more complex if you need to show relationships where multiple datasets are used for the derived:
parent 1, parent 2, parent 3 L child 1, parent 2 L child 2 (current dataset)
I think the above would be acceptable, especially if each dataset is clickable so that they can be found in the portal or linked outside.
It would be nice it it could be shown in a more dynamic way such as connected nodes (though this might be well beyond our current scope):
Indeed, a "file browser" structure might not work when a dataset has multiple parents. I had thought about an "email thread" kind of representation earlier but it has the same issue (derived datasets are a graph, not a tree). Your idea to have a more visual representation is awesome, I think it would be very impressive to give such a (clickable) high overview of relations between datasets. Here is another example of a possible visualization: https://brain-web.github.io/community/
Maybe we could break this down in the following 2 steps to make it more tractable?
The visual representation in your link is very nice. I think this would be very impressive as long as its also still usable and valuable for the user.
I think your steps would be good. A basic view and a more relationship-based view. (we can give the user a choice to load the relationship view)
OK! So how do you think the design of step 1 could look like?
I've been looking at https://fairsharing.org and they are using a 'chip' based system for their tags, etc.
https://fairsharing.org/FAIRsharing.jptb1m
maybe this can be adapted to our use:
Parent datasets: Chip 1, Chip 2, Chip, 3
This doesn't show the whole history, but it does show one step up (parents).
What are your thoughts? Is it enough to show the parents and have them clickable so that the user will at least see what info is in the parent? Over time as our DB grows, this will become more meaningful as each parent will link to other datasets in our system (as well as outside).
As a falloff, all this trace connection may be usable when generating the interactive graph.
Hi @3design, this sounds very good to me for a first step! I guess this could now be sent to @liamocn to implement? @liamocn, to summarize the discussion above, the goal would be to:
derivedFrom
attribues. Example: https://github.com/emmetaobrien/BigBrain_3DClassifiedVolumes/blob/9245806c015cf389aba4e4959d6e130c0322568f/DATS.json@emmetaobrien @3design @liamocn Can you guys work together to make that happen on the frontend before the end of June 2020?
I think there is a plan for @liamocn to start implementing on the portal.
Thank you!
@glatard @emmetaobrien Currently the value of the parent dataset is a URL. I can use this to create a link, but I have no way of labelling the link aside from with the url itself, as I can't reliably identify what the parent dataset actually is. Also this link will be an external one and will not link to that dataset on the portal, for the same reason. Is all this intended?
example of the content of the DATS.json file:
{
"category": "derivedFrom",
"values": [
{
"value": "https://github.com/conpdatasets/preventad-open/tree/acee97ba0ec6bb2398d69d519dd4be1cf710ac48"
}
]
}
This is nice to be able to track the exact commit that it is derived from but maybe we could add the dataset name in that so that Liam could link those in the portal?
Maybe something like that?
{
"category": "derivedFrom",
"values": [
{
"parent_dataset_id": "preventad-open",
"value": "https://github.com/conpdatasets/preventad-open/tree/acee97ba0ec6bb2398d69d519dd4be1cf710ac48"
}
]
}
@liamocn if there was a "parent_dataset_id" or "parent_dataset_name", would that work with you? @glatard @emmetaobrien thoughts? Other suggestions?
Yes that's sort of what I was thinking of. This also helps with identifying child datasets. Of course if I can assume that all parent datasets are going to have urls of the form https://github.com/conpdatasets/:id/...
then I can slice the id out but that will break if the assumption isn't true for all cases.
@cmadjar: Your suggestion sounds very sensible to me. What's the most straightforward value to include there for lookup, would just the dataset name suffice ?
@emmetaobrien what is being parsed by the portal to populate the database is the name of the submodule in .gitmodules
.
https://github.com/CONP-PCNO/conp-dataset/blob/master/.gitmodules
So for PREVENT-AD BIDS, the parent dataset name would be preventad-open
but for a dataset derived from BigBrain then one would need to add bigbrain-datalad
as the parent dataset name.
I can send a PR to add that for the PREVENT-AD BIDS dataset. Maybe @emmetaobrien you can take care of the other ones?
@liamocn the DATS.json of preventad-open-bids has been updated with the new key so that you can work on the portal design again. Thank you!
@cmadjar : I have just updated the BigBrain derived datasets accordingly.
I pulled the code from @liamocn on the portal and we can now see the child datasets and the parent datasets on the appropriate datasets.
@glatard @emmetaobrien is there anything else I am not seeing that need to be done for this issue? Thank you!
Purpose
Integrate derived BigBrain datasets in the portal, as a first use-case of derived data support. Make sure that derived dataset can be linked to their parent (and conversely), in the portal and in DataLad.
Context
Derived datasets are an important use case. BigBrain is a good case to start from, as there are a lot of derivatives, and it is an open dataset.
Possible Implementation
Start with https://portal.conp.ca/dataset?id=projects/Khanlab/BigBrainMRICoreg, as it is already in CONP.
Proposed work-plan:
@cmadjar, @akhanf, @samirdas